Skip to content

Shadow Lighthouse Raw Contract

Lighthouse reads the Octopus raw contract. It treats those files as the durable source of truth and writes its own read models separately.

Expected Octopus Source Layout

octopus/
  <source_name>/
    records/
      detail.jsonl
      month=YYYY-MM/
        detail.jsonl
    manifests/
      objects.jsonl
      objects-resolved.jsonl
      objects-failed.jsonl
    objects/
      sha256/
        <first-two-hex>/
          <next-two-hex>/
            <sha256>.<ext>

records/detail.jsonl and records/month=YYYY-MM/detail.jsonl are both valid. Lighthouse discovers raw records recursively under records/.

Raw Record Envelope

Each raw record line is one JSON object:

{
  "source_name": "cninfo_announcements",
  "record_type": "announcement_detail",
  "source_record_id": "announcement:1225267495",
  "published_at": "2026-05-14",
  "title": "平安银行:2025年年度报告",
  "detail_url": "https://example.com/detail",
  "issuer_names": ["平安银行"],
  "security_codes": ["000001"],
  "payload": {}
}

Important fields:

Field Meaning
source_name Source workspace name
record_type Source-specific record category
source_record_id Stable source-local document id
published_at Document publish date or timestamp when available
title Display and search title
issuer_names Names used for issuer lookup
security_codes Security codes used for issuer lookup
payload Source-specific body and metadata

If the same source_record_id appears more than once, catalog rebuild uses the later projection for that id.

Object Manifests

objects.jsonl records desired or discovered objects, including pending downloads.

objects-resolved.jsonl records objects that have been downloaded and placed under objects/.

A resolved object entry looks like:

{
  "source_name": "cninfo_announcements",
  "source_record_id": "announcement:1225267495",
  "object_role": "primary_attachment",
  "sha256": "616f43b56f3638670cd19260190265439d6af3c226112b174e161a162544b13f",
  "size_bytes": 100,
  "mime_type": "application/pdf",
  "file_ext": "pdf",
  "storage_rel_path": "sha256/61/6f/616f43b56f3638670cd19260190265439d6af3c226112b174e161a162544b13f.pdf",
  "source_url": "https://static.cninfo.com.cn/finalpage/doc.pdf",
  "filename_hint": "annual-report.pdf",
  "fetched_at": "2026-05-20T02:49:43Z"
}

Lighthouse only indexes objects from objects-resolved.jsonl when sha256 is not pending.

Lighthouse Outputs

Lighthouse writes read-side files under lighthouse/<source>/:

lighthouse/
  <source_name>/
    indexes/
      catalog.sqlite
      fts.sqlite
      tables.sqlite
      raw-ingest-state.json
    artifacts/
    canonical/

These files are rebuildable from Octopus raw data and derived extractors. They are not the raw source of truth.

Operational Rule

Do not make Lighthouse write into octopus/<source>/. Octopus owns raw records, manifests, object downloads, and source checkpoints. Lighthouse owns read-side projection only.