Shadow Octopus Supported Sources¶
Source definitions are TOML files under sources/.
The current production sources are enabled in the root sources/ directory. Example source kinds live under sources/examples/ and are disabled by default.
Production Sources¶
| Source | source_kind |
Remote source | Current scheduler role | Output |
|---|---|---|---|---|
cninfo_announcements |
cninfo |
CNInfo company announcements and historical announcement search | Historical backfill plus object download | Announcement records, pending/resolved PDF manifests, PDF objects |
cls_telegraph |
cls_telegraph |
财联社电报 roll list | Latest news sync | Telegraph news raw records |
sina7x24 |
sina7x24 |
新浪财经 7x24 feed | Latest news sync | 7x24 news raw records |
CNInfo Announcements¶
Configured file:
sources/cninfo_announcements.toml
Current acquisition paths:
| Path | Command | Purpose |
|---|---|---|
| Company page sync | sync-cninfo |
Newest-first company announcement pages |
| Historical search backfill | backfill-cninfo-history |
Reverse date/page backfill from CNInfo historical search |
| Object download | download-objects |
Download pending PDF objects into objects/sha256/... |
| Queue rebuild | rebuild-object-download-queue |
Rebuild the source-local object download queue from manifests |
| Legacy import | import-shadow-pdf |
Import old shadow-pdf CNInfo metadata and cached PDFs |
Current scheduler settings:
| Setting | Value |
|---|---|
enabled |
true |
mode |
history |
streams |
["a_share_all"] |
start_date |
2000-01-01 |
max_pages |
10 |
history_interval_seconds |
600 |
download_limit |
50 |
download_interval_seconds |
600 |
request_delay |
1.0 second |
CNInfo latest sync and historical backfill both append to the same raw contract. They are intentionally separate acquisition commands so historical fill and latest refresh can be scheduled independently.
CLS Telegraph¶
Configured file:
sources/cls_telegraph.toml
Current scheduler settings:
| Setting | Value |
|---|---|
enabled |
true |
interval_seconds |
300 |
count |
20 |
pages |
1 |
request_delay |
1.0 second |
Command:
uv run shadow-octopus --config config.example.toml sync-cls-telegraph \
--source cls_telegraph \
--count 20 \
--pages 1
Sina 7x24¶
Configured file:
sources/sina7x24.toml
Current scheduler settings:
| Setting | Value |
|---|---|
enabled |
true |
interval_seconds |
300 |
count |
100 |
pages |
1 |
request_delay |
1.0 second |
Commands:
uv run shadow-octopus --config config.example.toml sync-sina7x24 \
--source sina7x24 \
--count 100 \
--pages 1
uv run shadow-octopus --config config.example.toml backfill-sina7x24 \
--source sina7x24 \
--pages 100
The backfill command pages backward from the current minimum Sina item id when no explicit cursor is provided.
Reusable Source Kinds¶
These source kinds are supported by code and examples, but the example configs are disabled by default:
| Example source | source_kind |
Config file | Use case |
|---|---|---|---|
example_news_feed |
feed |
sources/examples/news_feed.toml |
RSS/Atom feeds |
example_json_news |
json_api |
sources/examples/json_news.toml |
Simple JSON list APIs with configured paths and pagination |
example_html_news |
html_page |
sources/examples/html_news.toml |
Simple HTML listing pages parsed by configured patterns |
Generic source kinds are for simple sources. Prefer a dedicated adapter when a website needs signing, anti-abuse handling, unusual pagination, or source-specific normalization.
Manual Acquisition Paths¶
Octopus also supports manual or semi-manual ingestion:
| Command | Purpose |
|---|---|
import-object |
Import a local PDF, MP3, MP4, XLSX, or similar file into a source workspace |
capture-url |
Fetch one URL directly into the source-local object store |
capture-url-list |
Fetch a slow URL list from text or JSONL input |
discover-links |
Discover supported object links from an HTML page into pending manifests |
Use these for one-off datasets, manual media, external reports, and migration tasks that still need to land in the same Octopus raw contract.