Shadow Octopus Operations¶

Octopus is operated through CLI commands and the source registry scheduler.

It does not run as a public API service.

Configuration¶

Local example:

data_root = "data/octopus"
sources_dir = "sources"
request_timeout = 30
verify_tls = true

Production convention:

data_root = "/dev/data1/shadow-lakehouse/octopus"
sources_dir = "sources"

On the server, set:

export SHADOW_OCTOPUS_CONFIG=/dev/data1/shadow-octopus/config.prod.toml

Scheduler¶

Run the scheduler:

uv run shadow-octopus --config config.example.toml run-scheduler \
  --max-workers 4

Run one cycle for selected sources:

uv run shadow-octopus --config config.example.toml run-scheduler \
  --source cls_telegraph \
  --source sina7x24 \
  --max-workers 2 \
  --once

Production wrapper:

scripts/run_scheduler.sh --max-workers 4

The wrapper writes scheduler logs and task logs under the ops directory and uses a scheduler-level lock to avoid running two schedulers at the same time.

Source Status¶

Check one source:

uv run shadow-octopus --config config.example.toml source-status \
  --source cninfo_announcements

Check all initialized sources:

uv run shadow-octopus --config config.example.toml root-status

Status output includes:

raw record counts
manifest counts
pending/resolved/failed object counts
local object bytes and file count
latest file mtimes
latest checkpoint
recent run records

Verify Before Lighthouse Ingest¶

uv run shadow-octopus --config config.example.toml verify-source \
  --source cninfo_announcements \
  --require-data

Then rebuild Lighthouse read-side indexes from the Octopus raw workspace.

CNInfo Object Queue¶

For high-volume CNInfo objects, use the source-local object download queue:

uv run shadow-octopus --config config.example.toml rebuild-object-download-queue \
  --source cninfo_announcements

Then download a bounded batch:

uv run shadow-octopus --config config.example.toml download-objects \
  --source cninfo_announcements \
  --limit 50

The downloader first selects ready rows from state.db. If no queue exists, it can fall back to manifest scanning.

Upload And Extract Lakehouse Data¶

Upload local raw data as shard archives:

scripts/upload_lakehouse_to_remote.sh

Extract on the server after checksum verification:

ssh aliyun 'SHADOW_REMOTE_ZSTD_BIN=/dev/data1/bin/zstd \
  /dev/data1/shadow-octopus/scripts/extract_lakehouse_shards_remote.sh \
  /dev/data1/shadow-lakehouse/.staging/<batch>'

The extractor verifies shard checksums and zstd frames before moving source directories into:

/dev/data1/shadow-lakehouse/octopus

Handoff To Lighthouse¶

After Octopus writes or migrates raw data, Lighthouse ingests from:

/dev/data1/shadow-lakehouse/octopus/<source>

and writes read-side indexes under:

/dev/data1/shadow-lakehouse/lighthouse/<source>

Keep this handoff one-way: Octopus writes raw data, Lighthouse reads raw data and serves queries.