Skip to content

Shadow Octopus

shadow-octopus is the write-side collector in the Shadow data stack.

It acquires data from configured sources, writes source-local raw records and object manifests, downloads raw objects, and records operational state in a local SQLite database.

It does not provide a public API. Query, search, document lookup, and serving belong to shadow-lighthouse.

What It Owns

Octopus owns acquisition and persistence:

  • source adapters
  • polite crawling and request pacing
  • source checkpoints
  • raw record append
  • object manifest append
  • object download queue and retry state
  • source-local run records
  • scheduler task execution

Octopus does not own:

  • full-text search
  • canonical read models
  • document serving
  • user-facing HTTP APIs
  • cross-source query semantics

Data Flow

source website/API/feed
  -> shadow-octopus source adapter
  -> octopus/<source>/records/month=YYYY-MM/detail.jsonl
  -> octopus/<source>/manifests/objects.jsonl
  -> octopus/<source>/objects/sha256/...
  -> shadow-lighthouse ingest
  -> lighthouse/<source>/indexes/*.sqlite

Octopus stops at the raw contract. Lighthouse reads that contract and builds the read side.

Local Workspace

data/octopus/
  <source_name>/
    state.db
    runs/
    records/
      month=YYYY-MM/
        detail.jsonl
    manifests/
      objects.jsonl
      objects-resolved.jsonl
      objects-failed.jsonl
    objects/
      sha256/

Source definitions live under:

sources/*.toml

config.example.toml only defines global defaults and the source directory:

data_root = "data/octopus"
sources_dir = "sources"