Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.wellapp.ai/llms.txt

Use this file to discover all available pages before exploring further.

DocumentExtraction stores the parsed-text output of one parser run against one Document. Each row represents a single (document, parser_name, parser_version) execution result and acts as a caching layer: the extraction orchestrator checks for an existing active row whose parser_name + parser_version + source_checksum match the current request and reuses the stored text instead of calling the remote parser (LlamaParse, LiteParse) again. One document may accumulate multiple rows when different parsers or parser versions are used; only one row per (document_pk, parser_name, parser_version, COALESCE(source_checksum,”)) is active at any time, enforced by a partial unique index on deleted_at IS NULL. The entity is workspace-scoped for tenant isolation and cleanup queries.
NamingValue
ObjectDocumentExtraction
Resource type (JSON:API type)document_extraction
Collection / records root(not a records root)
REST base/v1/document-extractions
Entity classDocumentExtraction
Internal object. Not currently exposed on the public REST API. The operations below describe the intended contract.

API operations

OperationMethod & pathStatus
ListGET /v1/document-extractions🟡 Planned
RetrieveGET /v1/document-extractions/{id}🟡 Planned
CreatePOST /v1/document-extractions🟡 Planned
UpdatePATCH /v1/document-extractions/{id}🟡 Planned
DeleteDELETE /v1/document-extractions/{id}🟡 Planned

Data model

Attributes

FieldTypeRequiredConstraintsAllowed valuesDescription
document_extraction_idstring (UUID)✅ YesUNIQUE; NOT NULL; generated by gen_random_uuid()Any valid UUID v4Public UUID identifying this extraction record. Generated automatically by the database default gen_random_uuid().
parser_namestring✅ YesVARCHAR(64); NOT NULLAny string up to 64 charactersIdentifies the parser engine used for this extraction run (e.g. ‘llamaparse’, ‘litepdf’). Part of the dedup key.
parser_versionstring✅ YesVARCHAR(32); NOT NULLAny string up to 32 charactersSemantic version of the parser at extraction time. Combined with parser_name and source_checksum to determine cache validity.
textstring (text)✅ YesTEXT; NOT NULLAny textPlain-text output produced by the parser. This is the primary extraction artifact consumed by downstream AI extraction services.
markdownstring (text)⚪ NoTEXT; nullableAny text or nullOptional Markdown-formatted representation of the extracted content, provided when the parser supports structured markdown output.
page_countinteger⚪ NoINT; nullableAny positive integer or nullNumber of pages processed by the parser, when reported by the parser response.
used_ocrboolean⚪ NoBOOLEAN; nullabletrue / false / nullWhether the parser applied OCR during this extraction run. Null when the parser did not report OCR usage.
llamaparse_job_idstring⚪ NoVARCHAR(128); nullableAny string up to 128 characters or nullExternal job identifier returned by the LlamaParse API for this parsing job. Useful for troubleshooting or re-fetching results from the provider.
source_checksumstring⚪ NoVARCHAR(64); nullable; participates in partial unique index uniq_document_extraction_per_parser_checksum (document_pk, parser_name, parser_version, COALESCE(source_checksum,”)) WHERE deleted_at IS NULLAny string up to 64 characters or nullContent-addressable checksum of the source document binary at the time of extraction. Together with parser_name and parser_version, forms the cache key for dedup. A NULL checksum is treated as an empty string in the partial unique index.
created_at🔒 system — datetime✅ YesTIMESTAMPTZ; NOT NULL; default NOW()Timestamp when this extraction record was created. Set once at insert via onCreate hook; never updated.
updated_at🔒 system — datetime⚪ NoTIMESTAMPTZ; nullableTimestamp of the last in-place update to this record. Set by onUpdate hook; null until the first update after initial insert.
deleted_at🔒 system — datetime⚪ NoTIMESTAMPTZ; nullableSoft-delete timestamp. When non-null the record is logically deleted. The partial unique index on the cache key is scoped to WHERE deleted_at IS NULL, allowing a new extraction to replace a soft-deleted one.

Relationships

NameTypeRequiredDescription
documentto-one (ManyToOne)Yes — FK NOT NULL, ON DELETE CASCADEThe Document this extraction was produced from. Foreign key document_pk references core_api.documents(pk). Cascade-deletes this row if the parent Document is hard-deleted.
workspaceto-one (ManyToOne)Yes — FK NOT NULL, ON DELETE CASCADEThe Workspace that owns this extraction, used for tenant isolation and cleanup queries. Foreign key workspace_pk references core_api.workspaces(pk). Cascade-deletes this row if the Workspace is hard-deleted.

System-computed

  • document_extraction_id — generated by gen_random_uuid() database default at INSERT; never supplied by the caller
  • created_at — set once via MikroORM onCreate hook (new Date()); immutable after creation
  • updated_at — set by MikroORM onUpdate hook (new Date()) on every subsequent flush; null until the first update
  • deleted_at — written by the extraction orchestrator or backfill services to soft-delete superseded rows; the partial unique index is scoped to deleted_at IS NULL so a new parser run can replace a deleted cache entry without a constraint violation
  • Cache-dedup logic — DocumentExtractionRepository (or the orchestrator) queries for an existing active row matching (document_pk, parser_name, parser_version, source_checksum) before calling the remote parser; a hit returns the stored text without an external API call
  • source_checksum coalesce sentinel — the partial unique index uses COALESCE(source_checksum, ”) so a null checksum does not cause PostgreSQL to treat every null-checksum row as distinct (PostgreSQL treats NULLs as distinct in unique indexes by default)
  • Workspace scope — workspace_pk is always set to req.workspace.pk at write time by the extraction pipeline; never accepted from client input
  • parser_name and parser_version are stamped at extraction time from the calling service constant and the deployed parser SDK version; not user-supplied

Example

{
  "data": {
    "type": "document_extraction",
    "id": "3f8e2c14-0a71-4b9e-a632-1c7de05f8b24",
    "attributes": {
      "document_extraction_id": "3f8e2c14-0a71-4b9e-a632-1c7de05f8b24",
      "parser_name": "llamaparse",
      "parser_version": "2.3.0",
      "text": "Invoice\nDate: 2026-04-15\nTotal: €1 250,00\n...",
      "markdown": "# Invoice\n**Date:** 2026-04-15  \n**Total:** €1 250,00\n\n...",
      "page_count": 2,
      "used_ocr": false,
      "llamaparse_job_id": "lp_job_a1b2c3d4e5f6",
      "source_checksum": "sha256:aabbcc1122334455",
      "created_at": "2026-05-28T09:14:32.000Z",
      "updated_at": "2026-05-28T09:14:55.000Z",
      "deleted_at": null
    },
    "relationships": {
      "document": {
        "data": { "type": "document", "id": "9c4a7b23-1f5d-4e88-b731-2d6ef19a0c47" }
      },
      "workspace": {
        "data": { "type": "workspace", "id": "1a2b3c4d-5e6f-7a8b-9c0d-e1f2a3b4c5d6" }
      }
    }
  }
}
Source: /Users/maximechampoux/platform/apps/api/src/database/entities/DocumentExtraction.ts · domain: ingestion · tier: Infrastructure