EPecho-pdf docs
Concept

Semantic structure is a separate layer.

get_semantic_document_structure() writes semantic-structure.json alongside the page index. It adds heading and section structure plus cross-page merged tables, formulas, and figures without changing the shape of pages[].

Combined per-page extraction

One vision call per page extracts heading candidates, tables (LaTeX tabular), formulas (LaTeX math), and figures in a single prompt.

Cross-page aggregation

Headings are assembled into a section tree. Tables, formulas, and figures are merged across page boundaries using truncation flags.

Output shape

root contains the heading tree. Optional top-level tables[], formulas[], and figures[] carry cross-page merged elements with startPage/endPage.

LLM requirement

Semantic extraction requires an explicitly configured local provider and model.

detector = agent-structured-v1

Downstream rule

Read detector and strategy metadata before assuming semantic richness or cache reuse.

Not domain logic

The semantic layer is general document structure. Domain-specific interpretation belongs downstream.

FieldWhy it exists
detectoridentifies which semantic extraction path produced the artifact
strategyKeychanges when provider, model, or extraction budget changes enough to invalidate reuse
pageIndexArtifactPathlinks the semantic layer back to the stable page index
pageArtifactPathlets section nodes point back to the originating page artifact

Still not domain logic.

The semantic layer is general document structure. It should not encode datasheet-specific, EDA-specific, or other downstream product semantics.