Concept

Semantic structure is a separate layer.

get_semantic_document_structure() writes semantic-structure.json alongside the page index. It adds heading and section structure plus cross-page merged tables, formulas, and figures without changing the shape of pages[].

Combined per-page extraction

One vision call per page extracts heading candidates, tables (LaTeX tabular), formulas (LaTeX math), and figures in a single prompt.

Cross-page aggregation

Headings are assembled into a section tree. Tables, formulas, and figures are merged across page boundaries using truncation flags.

Output shape

root contains the heading tree. Optional top-level tables[], formulas[], and figures[] carry cross-page merged elements with startPage/endPage.

LLM requirement

Semantic extraction requires an explicitly configured local provider and model.

detector = agent-structured-v1

Downstream rule

Read detector and strategy metadata before assuming semantic richness or cache reuse.

Not domain logic

The semantic layer is general document structure. Domain-specific interpretation belongs downstream.

Metadata that matters

Field	Why it exists
`detector`	identifies which semantic extraction path produced the artifact
`strategyKey`	changes when provider, model, or extraction budget changes enough to invalidate reuse
`pageIndexArtifactPath`	links the semantic layer back to the stable page index
`pageArtifactPath`	lets section nodes point back to the originating page artifact

Still not domain logic.

The semantic layer is general document structure. It should not encode datasheet-specific, EDA-specific, or other downstream product semantics.