DocumentExtraction

DocumentExtraction stores the parsed-text output of one parser run against one Document. Each row represents a single (document, parser_name, parser_version) execution result and acts as a caching layer: the extraction orchestrator checks for an existing active row whose parser_name + parser_version + source_checksum match the current request and reuses the stored text instead of calling the remote parser (LlamaParse, LiteParse) again. One document may accumulate multiple rows when different parsers or parser versions are used; only one row per (document_pk, parser_name, parser_version, COALESCE(source_checksum,”)) is active at any time, enforced by a partial unique index on deleted_at IS NULL. The entity is workspace-scoped for tenant isolation and cleanup queries.

Naming	Value
Object	DocumentExtraction
Resource type (JSON:API `type`)	`document_extraction`
Collection / records root	— _{(not a records root)}
REST base	`/v1/document-extractions`
Entity class	`DocumentExtraction`

Internal object. Not currently exposed on the public REST API. The operations below describe the intended contract.

API operations

Operation	Method & path	Status
List	`GET /v1/document-extractions`	🟡 Planned
Retrieve	`GET /v1/document-extractions/{id}`	🟡 Planned
Create	`POST /v1/document-extractions`	🟡 Planned
Update	`PATCH /v1/document-extractions/{id}`	🟡 Planned
Delete	`DELETE /v1/document-extractions/{id}`	🟡 Planned

Data model

Attributes

Field	Type	Required	Constraints	Allowed values	Description
document_extraction_id	string (UUID)	✅ Yes	UNIQUE; NOT NULL; generated by gen_random_uuid()	Any valid UUID v4	Public UUID identifying this extraction record. Generated automatically by the database default gen_random_uuid().
parser_name	string	✅ Yes	VARCHAR(64); NOT NULL	Any string up to 64 characters	Identifies the parser engine used for this extraction run (e.g. ‘llamaparse’, ‘litepdf’). Part of the dedup key.
parser_version	string	✅ Yes	VARCHAR(32); NOT NULL	Any string up to 32 characters	Semantic version of the parser at extraction time. Combined with parser_name and source_checksum to determine cache validity.
text	string (text)	✅ Yes	TEXT; NOT NULL	Any text	Plain-text output produced by the parser. This is the primary extraction artifact consumed by downstream AI extraction services.
markdown	string (text)	⚪ No	TEXT; nullable	Any text or null	Optional Markdown-formatted representation of the extracted content, provided when the parser supports structured markdown output.
page_count	integer	⚪ No	INT; nullable	Any positive integer or null	Number of pages processed by the parser, when reported by the parser response.
used_ocr	boolean	⚪ No	BOOLEAN; nullable	true / false / null	Whether the parser applied OCR during this extraction run. Null when the parser did not report OCR usage.
llamaparse_job_id	string	⚪ No	VARCHAR(128); nullable	Any string up to 128 characters or null	External job identifier returned by the LlamaParse API for this parsing job. Useful for troubleshooting or re-fetching results from the provider.
source_checksum	string	⚪ No	VARCHAR(64); nullable; participates in partial unique index uniq_document_extraction_per_parser_checksum (document_pk, parser_name, parser_version, COALESCE(source_checksum,”)) WHERE deleted_at IS NULL	Any string up to 64 characters or null	Content-addressable checksum of the source document binary at the time of extraction. Together with parser_name and parser_version, forms the cache key for dedup. A NULL checksum is treated as an empty string in the partial unique index.
created_at	🔒 system — datetime	✅ Yes	TIMESTAMPTZ; NOT NULL; default NOW()	—	Timestamp when this extraction record was created. Set once at insert via onCreate hook; never updated.
updated_at	🔒 system — datetime	⚪ No	TIMESTAMPTZ; nullable	—	Timestamp of the last in-place update to this record. Set by onUpdate hook; null until the first update after initial insert.
deleted_at	🔒 system — datetime	⚪ No	TIMESTAMPTZ; nullable	—	Soft-delete timestamp. When non-null the record is logically deleted. The partial unique index on the cache key is scoped to WHERE deleted_at IS NULL, allowing a new extraction to replace a soft-deleted one.

Relationships

Name	Type	Required	Description
document	to-one (ManyToOne)	Yes — FK NOT NULL, ON DELETE CASCADE	The Document this extraction was produced from. Foreign key document_pk references core_api.documents(pk). Cascade-deletes this row if the parent Document is hard-deleted.
workspace	to-one (ManyToOne)	Yes — FK NOT NULL, ON DELETE CASCADE	The Workspace that owns this extraction, used for tenant isolation and cleanup queries. Foreign key workspace_pk references core_api.workspaces(pk). Cascade-deletes this row if the Workspace is hard-deleted.

System-computed

document_extraction_id — generated by gen_random_uuid() database default at INSERT; never supplied by the caller
created_at — set once via MikroORM onCreate hook (new Date()); immutable after creation
updated_at — set by MikroORM onUpdate hook (new Date()) on every subsequent flush; null until the first update
deleted_at — written by the extraction orchestrator or backfill services to soft-delete superseded rows; the partial unique index is scoped to deleted_at IS NULL so a new parser run can replace a deleted cache entry without a constraint violation
Cache-dedup logic — DocumentExtractionRepository (or the orchestrator) queries for an existing active row matching (document_pk, parser_name, parser_version, source_checksum) before calling the remote parser; a hit returns the stored text without an external API call
source_checksum coalesce sentinel — the partial unique index uses COALESCE(source_checksum, ”) so a null checksum does not cause PostgreSQL to treat every null-checksum row as distinct (PostgreSQL treats NULLs as distinct in unique indexes by default)
Workspace scope — workspace_pk is always set to req.workspace.pk at write time by the extraction pipeline; never accepted from client input
parser_name and parser_version are stamped at extraction time from the calling service constant and the deployed parser SDK version; not user-supplied

Example

{
  "data": {
    "type": "document_extraction",
    "id": "3f8e2c14-0a71-4b9e-a632-1c7de05f8b24",
    "attributes": {
      "document_extraction_id": "3f8e2c14-0a71-4b9e-a632-1c7de05f8b24",
      "parser_name": "llamaparse",
      "parser_version": "2.3.0",
      "text": "Invoice\nDate: 2026-04-15\nTotal: €1 250,00\n...",
      "markdown": "# Invoice\n**Date:** 2026-04-15  \n**Total:** €1 250,00\n\n...",
      "page_count": 2,
      "used_ocr": false,
      "llamaparse_job_id": "lp_job_a1b2c3d4e5f6",
      "source_checksum": "sha256:aabbcc1122334455",
      "created_at": "2026-05-28T09:14:32.000Z",
      "updated_at": "2026-05-28T09:14:55.000Z",
      "deleted_at": null
    },
    "relationships": {
      "document": {
        "data": { "type": "document", "id": "9c4a7b23-1f5d-4e88-b731-2d6ef19a0c47" }
      },
      "workspace": {
        "data": { "type": "workspace", "id": "1a2b3c4d-5e6f-7a8b-9c0d-e1f2a3b4c5d6" }
      }
    }
  }
}

_{Source: /Users/maximechampoux/platform/apps/api/src/database/entities/DocumentExtraction.ts · domain: ingestion · tier: Infrastructure}

Main objects

Supporting objects

Platform objects

Activity objects

Infrastructure objects

API operations

Data model

Attributes

Relationships

System-computed

Example

​API operations

​Data model

​Attributes

​Relationships

​System-computed

​Example

API operations

Data model

Attributes

Relationships

System-computed

Example