Shadow Octopus¶
shadow-octopus is the write-side collector in the Shadow data stack.
It acquires data from configured sources, writes source-local raw records and object manifests, downloads raw objects, and records operational state in a local SQLite database.
It does not provide a public API. Query, search, document lookup, and serving belong to shadow-lighthouse.
What It Owns¶
Octopus owns acquisition and persistence:
- source adapters
- polite crawling and request pacing
- source checkpoints
- raw record append
- object manifest append
- object download queue and retry state
- source-local run records
- scheduler task execution
Octopus does not own:
- full-text search
- canonical read models
- document serving
- user-facing HTTP APIs
- cross-source query semantics
Data Flow¶
source website/API/feed
-> shadow-octopus source adapter
-> octopus/<source>/records/month=YYYY-MM/detail.jsonl
-> octopus/<source>/manifests/objects.jsonl
-> octopus/<source>/objects/sha256/...
-> shadow-lighthouse ingest
-> lighthouse/<source>/indexes/*.sqlite
Octopus stops at the raw contract. Lighthouse reads that contract and builds the read side.
Local Workspace¶
data/octopus/
<source_name>/
state.db
runs/
records/
month=YYYY-MM/
detail.jsonl
manifests/
objects.jsonl
objects-resolved.jsonl
objects-failed.jsonl
objects/
sha256/
Source definitions live under:
sources/*.toml
config.example.toml only defines global defaults and the source directory:
data_root = "data/octopus"
sources_dir = "sources"