Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.wellapp.ai/llms.txt

Use this file to discover all available pages before exploring further.

A document_structured_extraction record is the persisted output of one LLM structured-extraction pass over a parsed document. It is separate from document_extractions (which stores raw OCR/parser text): this table stores the typed, schema-validated JSON produced after classification and field mapping. The cache key — a composite of document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, and evidence_checksum — ensures that retries and backfills reuse existing output without bypassing stale-source or stale-schema guards. Records are scoped to a workspace and hang off a parent document; soft-deletion is used to invalidate superseded cache entries while preserving the audit trail.
NamingValue
ObjectDocumentStructuredExtraction
Resource type (JSON:API type)document_structured_extraction
Collection / records root(not a records root)
REST base/v1/document-structured-extractions
Entity classDocumentStructuredExtraction
Internal object. Not currently exposed on the public REST API. The operations below describe the intended contract.

API operations

OperationMethod & pathStatus
ListGET /v1/document-structured-extractions🟡 Planned
RetrieveGET /v1/document-structured-extractions/{id}🟡 Planned
CreatePOST /v1/document-structured-extractions🟡 Planned
UpdatePATCH /v1/document-structured-extractions/{id}🟡 Planned
DeleteDELETE /v1/document-structured-extractions/{id}🟡 Planned

Data model

Attributes

FieldTypeRequiredConstraintsAllowed valuesDescription
document_structured_extraction_idstring (UUID)✅ YesuniquePublic UUID identifier, auto-generated by gen_random_uuid(). This is the stable external reference surfaced in the API. The internal pk is never exposed.
extraction_familystring✅ Yesmax length 32; included in partial-unique-on-deleted_at indexHigh-level category of the extraction pass (e.g. ‘invoice’, ‘receipt’, ‘contract’). Part of the cache-key composite unique index.
schema_namestring✅ Yesmax length 96; included in partial-unique-on-deleted_at indexName of the Zod/JSON schema used to validate and shape the extraction output. Part of the cache-key composite unique index.
schema_versionstring✅ Yesmax length 48; included in partial-unique-on-deleted_at indexSemver string of the schema. Bump forces a new extraction even if all other cache-key components are unchanged.
prompt_versionstring✅ Yesmax length 48; included in partial-unique-on-deleted_at indexVersion identifier of the LLM prompt template used. Part of the cache-key composite unique index.
model_policystring✅ Yesmax length 64; included in partial-unique-on-deleted_at indexIdentifies the model-selection policy in effect when the extraction ran (e.g. a model alias or routing policy name). Part of the cache-key composite unique index.
source_checksumstring✅ Yesmax length 64; included in partial-unique-on-deleted_at indexChecksum of the raw source content (OCR text / parsed PDF bytes) fed into the extraction. Stale-source guard: if the document’s parsed text changes, this checksum changes and a new extraction is required.
evidence_checksumstring✅ Yesmax length 64; included in partial-unique-on-deleted_at indexChecksum of the structured evidence slice passed to the LLM (may differ from source_checksum when evidence is preprocessed or truncated). Part of the cache-key composite unique index.
classification_jsonjsonb✅ YesNOT NULLLLM output from the classification phase: document type, confidence scores, locale, and any routing signals that determined which schema to apply in subsequent passes.
core_extraction_jsonjsonb✅ YesNOT NULLLLM output from the core extraction phase: the mandatory, high-confidence fields (header-level invoice fields such as issuer, reference, date, totals). Always present even when detail extraction is skipped.
detail_extraction_jsonjsonb⚪ NonullableLLM output from the optional detail extraction phase: line items, tax breakdowns, and other structured sub-arrays that require a secondary prompt. NULL when detail extraction was not requested or failed gracefully.
final_extraction_jsonjsonb✅ YesNOT NULLMerged, post-processed extraction output combining classification, core, and detail phases. This is the authoritative input used by downstream invoice mapping and persistence services.
invoice_mapped_jsonjsonb⚪ NonullableThe result of mapping final_extraction_json onto Well’s internal invoice schema (entity PKs resolved, field names normalised). NULL when the document is not an invoice or mapping has not yet run.
selected_providerstring⚪ Nomax length 32; nullableThe AI provider selected at runtime (e.g. ‘openai’, ‘anthropic’). NULL when the model policy does not record per-extraction provider choice.
selected_modelstring⚪ Nomax length 128; nullableThe specific model identifier resolved from the model_policy at runtime (e.g. ‘gpt-5.4’). NULL when not recorded.
quality_flagsjsonb⚪ NonullablePost-extraction quality signals: low-confidence fields, OCR warnings, schema-validation failures, and any flags the extraction pipeline chose to surface for downstream review. NULL when no quality issues were detected.
created_at🔒 system — timestamp with time zone✅ YesNOT NULL; defaultRaw: NOW()Row creation timestamp, set once by the onCreate hook. Reflects when the extraction pipeline persisted this result.
updated_at🔒 system — timestamp with time zone⚪ NonullableRow update timestamp, managed by the onUpdate hook. NULL until the first update after creation.
deleted_at🔒 system — timestamp with time zone⚪ Nonullable; excluded from uniq_document_structured_extraction_active when non-NULLSoft-delete timestamp. When set, the row is excluded from the composite partial-unique index (uniq_document_structured_extraction_active), allowing a fresh extraction with the same cache key to be inserted.

Relationships

NameTypeRequiredDescription
documentto-one (ManyToOne)✅ YesThe parent Document this extraction was produced from. The document_pk FK carries ON DELETE CASCADE, so deleting a document hard-deletes all its structured extractions.
workspaceto-one (ManyToOne)✅ YesThe tenant workspace that owns this extraction record. Used for all workspace-scoped queries and Hasura RLS enforcement. workspace_pk FK carries ON DELETE CASCADE.

System-computed

  • document_structured_extraction_id — auto-generated by gen_random_uuid() database default; never supplied by the caller
  • created_at — set to NOW() on INSERT via MikroORM onCreate hook; never updated
  • updated_at — set to NOW() on UPDATE via MikroORM onUpdate hook; NULL on creation
  • deleted_at — set by the extraction pipeline soft-delete path; when set, the partial-unique index uniq_document_structured_extraction_active no longer covers the row, enabling a replacement extraction with the same cache key to be inserted
  • Cache-key deduplication — the partial unique index (document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, evidence_checksum) WHERE deleted_at IS NULL guarantees at-most-one live structured extraction per distinct extraction context; the pipeline soft-deletes the old row before inserting a fresh one on invalidation
  • Row written exclusively by the LLM extraction pipeline (ExtractPersistenceService / document structured-extraction service); no user-facing PATCH route exists for this entity

Example

{
  "data": {
    "type": "document_structured_extraction",
    "id": "d3a1e8f2-5b4c-4e2f-9aab-0123456789ab",
    "attributes": {
      "extraction_family": "invoice",
      "schema_name": "invoice_v2_fr",
      "schema_version": "2.4.0",
      "prompt_version": "p1.3",
      "model_policy": "gpt-5_structured_v1",
      "source_checksum": "sha256:a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
      "evidence_checksum": "sha256:f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3",
      "classification_json": {
        "document_type": "invoice",
        "confidence": 0.98,
        "locale": "fr-FR"
      },
      "core_extraction_json": {
        "issuer_name": "Acme SAS",
        "reference_number": "FAC-2026-0042",
        "issue_date": "2026-05-15",
        "grand_total": 1200.00,
        "currency": "EUR"
      },
      "detail_extraction_json": {
        "line_items": [
          { "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
          { "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
        ]
      },
      "final_extraction_json": {
        "issuer_name": "Acme SAS",
        "reference_number": "FAC-2026-0042",
        "issue_date": "2026-05-15",
        "grand_total": 1200.00,
        "currency": "EUR",
        "line_items": [
          { "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
          { "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
        ]
      },
      "invoice_mapped_json": {
        "issuer_pk": 482917,
        "receiver_pk": 482918,
        "grand_total": 1200.00,
        "local_currency": "EUR",
        "status": "unpaid"
      },
      "selected_provider": "openai",
      "selected_model": "gpt-5.4",
      "quality_flags": {
        "low_confidence_fields": ["due_date"],
        "ocr_warnings": []
      },
      "created_at": "2026-05-28T14:32:10.000Z",
      "updated_at": "2026-05-28T14:32:11.000Z",
      "deleted_at": null
    },
    "relationships": {
      "document": {
        "data": { "type": "document", "id": "c7f2a1e0-0001-0001-0001-000000000001" }
      },
      "workspace": {
        "data": { "type": "workspace", "id": "9f3b2d00-aaaa-bbbb-cccc-000000000001" }
      }
    }
  }
}
Source: apps/api/src/database/entities/DocumentStructuredExtraction.ts · domain: ingestion · tier: Infrastructure