Skip to content

Shadow Octopus Operations

Octopus is operated through CLI commands and the source registry scheduler.

It does not run as a public API service.

Configuration

Local example:

data_root = "data/octopus"
sources_dir = "sources"
request_timeout = 30
verify_tls = true

Production convention:

data_root = "/dev/data1/shadow-lakehouse/octopus"
sources_dir = "sources"

On the server, set:

export SHADOW_OCTOPUS_CONFIG=/dev/data1/shadow-octopus/config.prod.toml

Scheduler

Run the scheduler:

uv run shadow-octopus --config config.example.toml run-scheduler \
  --max-workers 4

Run one cycle for selected sources:

uv run shadow-octopus --config config.example.toml run-scheduler \
  --source cls_telegraph \
  --source sina7x24 \
  --max-workers 2 \
  --once

Production wrapper:

scripts/run_scheduler.sh --max-workers 4

The wrapper writes scheduler logs and task logs under the ops directory and uses a scheduler-level lock to avoid running two schedulers at the same time.

Source Status

Check one source:

uv run shadow-octopus --config config.example.toml source-status \
  --source cninfo_announcements

Check all initialized sources:

uv run shadow-octopus --config config.example.toml root-status

Status output includes:

  • raw record counts
  • manifest counts
  • pending/resolved/failed object counts
  • local object bytes and file count
  • latest file mtimes
  • latest checkpoint
  • recent run records

Verify Before Lighthouse Ingest

uv run shadow-octopus --config config.example.toml verify-source \
  --source cninfo_announcements \
  --require-data

Then rebuild Lighthouse read-side indexes from the Octopus raw workspace.

CNInfo Object Queue

For high-volume CNInfo objects, use the source-local object download queue:

uv run shadow-octopus --config config.example.toml rebuild-object-download-queue \
  --source cninfo_announcements

Then download a bounded batch:

uv run shadow-octopus --config config.example.toml download-objects \
  --source cninfo_announcements \
  --limit 50

The downloader first selects ready rows from state.db. If no queue exists, it can fall back to manifest scanning.

Upload And Extract Lakehouse Data

Upload local raw data as shard archives:

scripts/upload_lakehouse_to_remote.sh

Extract on the server after checksum verification:

ssh aliyun 'SHADOW_REMOTE_ZSTD_BIN=/dev/data1/bin/zstd \
  /dev/data1/shadow-octopus/scripts/extract_lakehouse_shards_remote.sh \
  /dev/data1/shadow-lakehouse/.staging/<batch>'

The extractor verifies shard checksums and zstd frames before moving source directories into:

/dev/data1/shadow-lakehouse/octopus

Handoff To Lighthouse

After Octopus writes or migrates raw data, Lighthouse ingests from:

/dev/data1/shadow-lakehouse/octopus/<source>

and writes read-side indexes under:

/dev/data1/shadow-lakehouse/lighthouse/<source>

Keep this handoff one-way: Octopus writes raw data, Lighthouse reads raw data and serves queries.