Skip to content

Shadow Octopus Stable Design

This page records the parts of Octopus that should stay stable unless there is a deliberate data-contract migration.

No Public API

Octopus is not a serving layer.

It exposes CLI commands and scheduler workers for acquisition, but it does not expose a public HTTP API or SDK query surface. Consumers should read through shadow-lighthouse, not directly from Octopus process memory.

The durable interface is the filesystem raw contract under:

octopus/<source>/

Source-Local Ownership

Each source owns its own workspace:

octopus/<source_name>/
  state.db
  records/
  manifests/
  objects/

This keeps failures and rebuilds bounded by source. A slow CNInfo backfill should not corrupt or block Sina 7x24 records; a downloader retry queue belongs to the source that produced the object manifest.

Append-Only Raw Records

Raw records are written as JSONL envelopes.

Current writers use monthly partitions when possible:

records/month=YYYY-MM/detail.jsonl

The month comes from payload.create_time or published_at. If no month can be derived, the record goes under:

records/month=unknown/detail.jsonl

Readers must discover record files recursively under records/ instead of assuming one flat records/detail.jsonl.

Object Manifests

Object acquisition is split into two steps:

  1. Source sync writes pending object intent to manifests/objects.jsonl.
  2. download-objects resolves pending objects into objects/sha256/... and appends manifests/objects-resolved.jsonl.

Failures are appended to:

manifests/objects-failed.jsonl

The JSONL manifests remain the audit trail.

Rebuildable State Database

state.db is source-local operational state, not the raw data contract.

It stores:

  • source checkpoints
  • run records
  • request pacing slots
  • CNInfo company catalog cache
  • object download queue state

The object download queue is a source-local SQLite execution index. It avoids scanning large manifests for every download-objects --limit 50 run, but it can be rebuilt from JSONL manifests when needed.

Scheduler Model

run-scheduler loads enabled sources from sources/*.toml, keeps scheduling state in memory, and starts source tasks as subprocess workers.

The scheduler design is intentionally single-machine:

  • no Redis
  • no distributed queue
  • no Kubernetes
  • no long-running worker pool beyond subprocesses

Each worker still writes only to one source workspace. Source-level locks prevent two workers from writing the same source concurrently.

Request Pacing

Sources define request_delay and request metadata in TOML. Local workers reserve request slots through source-local state before calling remote websites or APIs.

Slow acquisition is intentional. Octopus optimizes for resumability, low server pressure, and source respect over aggressive crawling.

Design Boundary

If a future feature needs search, ranking, cross-source joins, or user-facing serving, implement it in Lighthouse or another read-side projection.

If a future feature needs source crawling, raw append, checkpointing, or object download state, implement it in Octopus.