Skip to content

Document Ingestion

When a document is uploaded to a collection, Martha automatically parses it, splits it into searchable chunks, and generates vector embeddings. This process runs as a durable Temporal workflow on a dedicated task queue, isolated from the main API.

How It Works

Upload → Validate → Parse+Chunk → Enrich (parallel) → Finalize
           │            │               │                  │
           │            │               ├─ Embed           └─ Mark "ready"
           │            │               ├─ VLM Describe
           │            │               └─ ColPali Index
           │            └─ Docling parse + HybridChunker + page classification
           └─ Size, type, and tenant checks

Each stage is a separate Temporal activity with independent retry and timeout. The enrich stage runs three activities in parallel — all are non-fatal, so failures degrade gracefully rather than blocking the pipeline.

Page Rendering and Visual Processing

For PDFs, every page is rendered to PNG and uploaded to storage (up to MAX_RENDER_PAGES, default 300). This enables:

  • get_page_image and visual_search tools to return page images to agents
  • Visual indexing via ColPali (ColiVara SDK) for image-based retrieval across all pages

Additionally, pages are classified as text, drawing, or table based on element bounding boxes. Drawing pages get extra treatment:

  • VLM descriptions generated by Gemini 3 Flash, stored as searchable drawing_description chunks

All enrichment activities (embedding, VLM, vision indexing) are optional and non-fatal — if any fails, text search still works.

Supported Formats

FormatContent TypeNotes
PDFapplication/pdfFull text extraction with optional OCR
DOCXapplication/vnd.openxmlformats-officedocument.wordprocessingml.documentWord documents
PPTXapplication/vnd.openxmlformats-officedocument.presentationml.presentationPowerPoint
HTMLtext/htmlWeb pages
Markdowntext/markdownMarkdown files
CSVtext/csvTabular data
Imagesimage/png, image/jpeg, image/tiff, image/webp, image/bmpOCR text extraction

!!! info "Maximum file size" The default maximum is 50 MB per document. This can be adjusted via the INGESTION_MAX_DOC_SIZE environment variable.

Graceful Degradation

The enrich stage (embedding, VLM descriptions, visual indexing) is not required for a document to be usable. If any enrichment fails:

  • The document is still marked as "ready"
  • Chunks are stored with full text (keyword search works)
  • Semantic search is unavailable if embeddings failed
  • Drawing descriptions are unavailable if VLM failed
  • Visual retrieval is unavailable if ColPali indexing failed
  • All can be backfilled by re-ingesting

Monitoring Ingestion

Admin UI

The Documents page shows ingestion status for each document:

  • Gray badge — Pending (not yet started)
  • Blue badge with spinner — Ingesting (workflow running)
  • Green badge — Ready (ingestion complete)
  • Red badge — Error (with details)

When any document is actively ingesting, the page auto-refreshes every 3 seconds. Click on a document to see detailed progress: current stage, percentage, chunk count, and any errors.

API

Poll the ingestion status endpoint for programmatic monitoring:

bash
GET /api/admin/documents/{document_id}/ingestion-status?tenant_id=your-tenant

Response:

json
{
  "document_id": "a1b2c3d4-...",
  "ingestion_status": "ingesting",
  "stage": "embed",
  "progress_pct": 65,
  "chunk_count": 42,
  "revision_id": "e5f6g7h8-...",
  "error": null,
  "started_at": "2026-02-17T10:30:00Z",
  "completed_at": null
}

Stages progress through: validatecreate_revisionparse_and_chunkenrichfinalizedone.


Re-Ingestion

To re-process a document (e.g., after changing chunking settings or to backfill embeddings):

bash
POST /api/admin/documents/{document_id}/reingest?tenant_id=your-tenant

This creates a new revision — the previous revision and its chunks are preserved. The read_doc and search_docs tools always use the latest successful revision.

!!! warning "Concurrent re-ingestion" A document that is already being ingested cannot be re-ingested simultaneously. The API returns 409 Conflict until the current workflow completes.


Revisions

Each ingestion run creates an immutable revision:

  • Stores the raw parsed text and structured metadata (pages, sections)
  • Preserves a content hash for deduplication
  • Old revisions and their chunks remain in the database
  • read_doc always serves from the latest successful revision (no S3 download needed)
  • Revision number auto-increments per document

Backpressure

To prevent a single tenant from overwhelming the ingestion pipeline, Martha enforces a per-tenant concurrency limit (default: 5 concurrent workflows).

When a tenant hits the limit, upload and re-ingest requests return 429 Too Many Requests. The quota is released automatically when a workflow completes (success or failure) or after a TTL safety valve expires (default: 1 hour).


Configuration

All ingestion settings can be overridden via environment variables:

VariableDefaultDescription
INGESTION_CHUNK_SIZE500Target tokens per chunk
INGESTION_CHUNK_OVERLAP50Token overlap between chunks (~10%)
INGESTION_TOKENIZERcl100k_baseTokenizer model for chunk sizing
INGESTION_EMBEDDING_MODELtext-embedding-3-smallEmbedding model (any LiteLLM-compatible)
INGESTION_EMBEDDING_DIMS1536Expected embedding dimensions
INGESTION_EMBEDDING_BATCH100Chunks per embedding API call
INGESTION_EMBEDDING_RETRIES3Max retries for embedding failures
INGESTION_MAX_CONCURRENT5Max concurrent workflows per tenant
INGESTION_QUOTA_TTL3600Quota safety valve TTL in seconds
INGESTION_PARSE_TIMEOUT600Parse activity timeout in seconds
INGESTION_CHUNK_TIMEOUT120Chunk activity timeout in seconds
INGESTION_EMBED_TIMEOUT300Embed activity timeout in seconds
INGESTION_OCR_ENABLEDtrueEnable OCR for PDFs and images
INGESTION_MAX_DOC_SIZE52428800Maximum document size in bytes (50 MB)
INGESTION_MAX_ACTIVITIES2Max concurrent activities per worker
INGESTION_VLM_ENABLEDfalseEnable VLM drawing descriptions
INGESTION_VLM_MODELgemini/gemini-3-flashVLM model (any LiteLLM-compatible)
INGESTION_VLM_MAX_TOKENS8192Max tokens for VLM responses
INGESTION_MAX_DRAWING_PAGES100Max drawing pages to render per document
INGESTION_VISION_RETRIEVAL_ENABLEDfalseEnable ColPali visual indexing
COLIVARA_API_KEY(required if vision enabled)ColiVara API key
COLIVARA_BASE_URLhttps://api.colivara.comColiVara API endpoint
INGESTION_PAGE_IMAGE_DPI144Page rendering DPI (144 = 2x scale)

Running the Ingestion Worker

The ingestion worker runs as a separate process from the main API:

bash
python -m ingestion.worker

In Docker, it runs as its own container using Dockerfile.ingestion, which includes the Docling and PyTorch dependencies that the main API does not need.

yaml
# docker-compose snippet
ingestion-worker:
  build:
    dockerfile: Dockerfile.ingestion
  environment:
    - TEMPORAL_ADDRESS=temporal:7233
    - REDIS_URL=redis://redis:6379
    - DATABASE_URL=postgresql://...
  depends_on:
    - temporal
    - redis
    - postgres

Martha is built by aiaiai-pt.