Shadow Lighthouse Raw Contract¶
Lighthouse reads the Octopus raw contract. It treats those files as the durable source of truth and writes its own read models separately.
Expected Octopus Source Layout¶
octopus/
<source_name>/
records/
detail.jsonl
month=YYYY-MM/
detail.jsonl
manifests/
objects.jsonl
objects-resolved.jsonl
objects-failed.jsonl
objects/
sha256/
<first-two-hex>/
<next-two-hex>/
<sha256>.<ext>
records/detail.jsonl and records/month=YYYY-MM/detail.jsonl are both valid. Lighthouse discovers raw records recursively under records/.
Raw Record Envelope¶
Each raw record line is one JSON object:
{
"source_name": "cninfo_announcements",
"record_type": "announcement_detail",
"source_record_id": "announcement:1225267495",
"published_at": "2026-05-14",
"title": "平安银行:2025年年度报告",
"detail_url": "https://example.com/detail",
"issuer_names": ["平安银行"],
"security_codes": ["000001"],
"payload": {}
}
Important fields:
| Field | Meaning |
|---|---|
source_name |
Source workspace name |
record_type |
Source-specific record category |
source_record_id |
Stable source-local document id |
published_at |
Document publish date or timestamp when available |
title |
Display and search title |
issuer_names |
Names used for issuer lookup |
security_codes |
Security codes used for issuer lookup |
payload |
Source-specific body and metadata |
If the same source_record_id appears more than once, catalog rebuild uses the later projection for that id.
Object Manifests¶
objects.jsonl records desired or discovered objects, including pending downloads.
objects-resolved.jsonl records objects that have been downloaded and placed under objects/.
A resolved object entry looks like:
{
"source_name": "cninfo_announcements",
"source_record_id": "announcement:1225267495",
"object_role": "primary_attachment",
"sha256": "616f43b56f3638670cd19260190265439d6af3c226112b174e161a162544b13f",
"size_bytes": 100,
"mime_type": "application/pdf",
"file_ext": "pdf",
"storage_rel_path": "sha256/61/6f/616f43b56f3638670cd19260190265439d6af3c226112b174e161a162544b13f.pdf",
"source_url": "https://static.cninfo.com.cn/finalpage/doc.pdf",
"filename_hint": "annual-report.pdf",
"fetched_at": "2026-05-20T02:49:43Z"
}
Lighthouse only indexes objects from objects-resolved.jsonl when sha256 is not pending.
Lighthouse Outputs¶
Lighthouse writes read-side files under lighthouse/<source>/:
lighthouse/
<source_name>/
indexes/
catalog.sqlite
fts.sqlite
tables.sqlite
raw-ingest-state.json
artifacts/
canonical/
These files are rebuildable from Octopus raw data and derived extractors. They are not the raw source of truth.
Operational Rule¶
Do not make Lighthouse write into octopus/<source>/. Octopus owns raw records, manifests, object downloads, and source checkpoints. Lighthouse owns read-side projection only.