Shadow Octopus Operations¶
Octopus is operated through CLI commands and the source registry scheduler.
It does not run as a public API service.
Configuration¶
Local example:
data_root = "data/octopus"
sources_dir = "sources"
request_timeout = 30
verify_tls = true
Production convention:
data_root = "/dev/data1/shadow-lakehouse/octopus"
sources_dir = "sources"
On the server, set:
export SHADOW_OCTOPUS_CONFIG=/dev/data1/shadow-octopus/config.prod.toml
Scheduler¶
Run the scheduler:
uv run shadow-octopus --config config.example.toml run-scheduler \
--max-workers 4
Run one cycle for selected sources:
uv run shadow-octopus --config config.example.toml run-scheduler \
--source cls_telegraph \
--source sina7x24 \
--max-workers 2 \
--once
Production wrapper:
scripts/run_scheduler.sh --max-workers 4
The wrapper writes scheduler logs and task logs under the ops directory and uses a scheduler-level lock to avoid running two schedulers at the same time.
Source Status¶
Check one source:
uv run shadow-octopus --config config.example.toml source-status \
--source cninfo_announcements
Check all initialized sources:
uv run shadow-octopus --config config.example.toml root-status
Status output includes:
- raw record counts
- manifest counts
- pending/resolved/failed object counts
- local object bytes and file count
- latest file mtimes
- latest checkpoint
- recent run records
Verify Before Lighthouse Ingest¶
uv run shadow-octopus --config config.example.toml verify-source \
--source cninfo_announcements \
--require-data
Then rebuild Lighthouse read-side indexes from the Octopus raw workspace.
CNInfo Object Queue¶
For high-volume CNInfo objects, use the source-local object download queue:
uv run shadow-octopus --config config.example.toml rebuild-object-download-queue \
--source cninfo_announcements
Then download a bounded batch:
uv run shadow-octopus --config config.example.toml download-objects \
--source cninfo_announcements \
--limit 50
The downloader first selects ready rows from state.db. If no queue exists, it can fall back to manifest scanning.
Upload And Extract Lakehouse Data¶
Upload local raw data as shard archives:
scripts/upload_lakehouse_to_remote.sh
Extract on the server after checksum verification:
ssh aliyun 'SHADOW_REMOTE_ZSTD_BIN=/dev/data1/bin/zstd \
/dev/data1/shadow-octopus/scripts/extract_lakehouse_shards_remote.sh \
/dev/data1/shadow-lakehouse/.staging/<batch>'
The extractor verifies shard checksums and zstd frames before moving source directories into:
/dev/data1/shadow-lakehouse/octopus
Handoff To Lighthouse¶
After Octopus writes or migrates raw data, Lighthouse ingests from:
/dev/data1/shadow-lakehouse/octopus/<source>
and writes read-side indexes under:
/dev/data1/shadow-lakehouse/lighthouse/<source>
Keep this handoff one-way: Octopus writes raw data, Lighthouse reads raw data and serves queries.