Shadow Octopus Stable Design¶
This page records the parts of Octopus that should stay stable unless there is a deliberate data-contract migration.
No Public API¶
Octopus is not a serving layer.
It exposes CLI commands and scheduler workers for acquisition, but it does not expose a public HTTP API or SDK query surface. Consumers should read through shadow-lighthouse, not directly from Octopus process memory.
The durable interface is the filesystem raw contract under:
octopus/<source>/
Source-Local Ownership¶
Each source owns its own workspace:
octopus/<source_name>/
state.db
records/
manifests/
objects/
This keeps failures and rebuilds bounded by source. A slow CNInfo backfill should not corrupt or block Sina 7x24 records; a downloader retry queue belongs to the source that produced the object manifest.
Append-Only Raw Records¶
Raw records are written as JSONL envelopes.
Current writers use monthly partitions when possible:
records/month=YYYY-MM/detail.jsonl
The month comes from payload.create_time or published_at. If no month can be derived, the record goes under:
records/month=unknown/detail.jsonl
Readers must discover record files recursively under records/ instead of assuming one flat records/detail.jsonl.
Object Manifests¶
Object acquisition is split into two steps:
- Source sync writes pending object intent to
manifests/objects.jsonl. download-objectsresolves pending objects intoobjects/sha256/...and appendsmanifests/objects-resolved.jsonl.
Failures are appended to:
manifests/objects-failed.jsonl
The JSONL manifests remain the audit trail.
Rebuildable State Database¶
state.db is source-local operational state, not the raw data contract.
It stores:
- source checkpoints
- run records
- request pacing slots
- CNInfo company catalog cache
- object download queue state
The object download queue is a source-local SQLite execution index. It avoids scanning large manifests for every download-objects --limit 50 run, but it can be rebuilt from JSONL manifests when needed.
Scheduler Model¶
run-scheduler loads enabled sources from sources/*.toml, keeps scheduling state in memory, and starts source tasks as subprocess workers.
The scheduler design is intentionally single-machine:
- no Redis
- no distributed queue
- no Kubernetes
- no long-running worker pool beyond subprocesses
Each worker still writes only to one source workspace. Source-level locks prevent two workers from writing the same source concurrently.
Request Pacing¶
Sources define request_delay and request metadata in TOML. Local workers reserve request slots through source-local state before calling remote websites or APIs.
Slow acquisition is intentional. Octopus optimizes for resumability, low server pressure, and source respect over aggressive crawling.
Design Boundary¶
If a future feature needs search, ranking, cross-source joins, or user-facing serving, implement it in Lighthouse or another read-side projection.
If a future feature needs source crawling, raw append, checkpointing, or object download state, implement it in Octopus.