Skip to content

Shadow Lighthouse Deployment

Lighthouse is deployed as a single local read-side service. It reads shared lakehouse files and serves local SQLite indexes over HTTP.

Production Layout

Current server convention:

/dev/data1/shadow-lakehouse/
  octopus/
    <source_name>/      # raw contract written by shadow-octopus
  lighthouse/
    <source_name>/      # read-side indexes and artifacts

The production Lighthouse config should point at the read-side root:

data_root = "/dev/data1/shadow-lakehouse/lighthouse"

Rebuild Read-Side Indexes

From the shadow-lighthouse repository:

scripts/ingest_lakehouse_raw.sh --skip-text

This reads all source workspaces under:

/dev/data1/shadow-lakehouse/octopus

and writes read-side indexes under:

/dev/data1/shadow-lakehouse/lighthouse

Use a source filter when rebuilding one source:

scripts/ingest_lakehouse_raw.sh --skip-text --source cninfo_announcements

Incremental Ingest For Live Sources

For minute-level append-only sources, prefer incremental source ingest:

/dev/data1/shadow-lighthouse/.venv/bin/shadow-lighthouse \
  --config /dev/data1/shadow-lighthouse/config.prod.toml \
  ingest-raw-source \
  --source sina7x24 \
  --octopus-source-root /dev/data1/shadow-lakehouse/octopus/sina7x24 \
  --incremental \
  --max-records-per-run 20000 \
  --skip-text

This keeps CPU and memory bounded because Lighthouse reads only raw record lines beyond the stored byte offset.

Serve

Run the service:

/dev/data1/shadow-lighthouse/.venv/bin/shadow-lighthouse \
  --config /dev/data1/shadow-lighthouse/config.prod.toml \
  serve \
  --host 127.0.0.1 \
  --port 8766

In production this is managed by:

systemctl --user status shadow-lighthouse.service --no-pager

Smoke Tests

Check process health:

curl "http://127.0.0.1:8766/health"

Check source discovery:

curl "http://127.0.0.1:8766/sources"

Check CNInfo document serving:

curl "http://127.0.0.1:8766/documents?source=cninfo_announcements&limit=24"

Check source status from the CLI:

/dev/data1/shadow-lighthouse/.venv/bin/shadow-lighthouse \
  --config /dev/data1/shadow-lighthouse/config.prod.toml \
  source-status \
  --source cninfo_announcements

Operational Notes

  • Octopus and Lighthouse should remain separate: Octopus writes raw data; Lighthouse builds read-side projections.
  • Prefer source-local ingest and indexes before adding global state.
  • Use --skip-text for low-pressure catalog updates when PDF or HTML extraction is not needed.
  • Use incremental ingest for high-frequency append-only sources.
  • Rebuild source-local indexes when raw contract migration changes the historical file layout.