Document Ingestion

When a document is uploaded to a collection, Martha automatically parses it, splits it into searchable chunks, and generates vector embeddings. This process runs as a durable Temporal workflow on a dedicated task queue, isolated from the main API.

How It Works

Upload → Validate → Parse+Chunk → Enrich (parallel) → Finalize
           │            │               │                  │
           │            │               ├─ Embed           └─ Mark "ready"
           │            │               ├─ VLM Describe
           │            │               └─ ColPali Index
           │            └─ Docling parse + HybridChunker + page classification
           └─ Size, type, and tenant checks

Each stage is a separate Temporal activity with independent retry and timeout. The enrich stage runs three activities in parallel — all are non-fatal, so failures degrade gracefully rather than blocking the pipeline.

Page Rendering and Visual Processing

For PDFs, every page is rendered to PNG and uploaded to storage (up to MAX_RENDER_PAGES, default 300). This enables:

get_page_image and visual_search tools to return page images to agents
Visual indexing via ColPali (ColiVara SDK) for image-based retrieval across all pages

Additionally, pages are classified as text, drawing, or table based on element bounding boxes. Drawing pages get extra treatment:

VLM descriptions generated by Gemini 3 Flash, stored as searchable drawing_description chunks

All enrichment activities (embedding, VLM, vision indexing) are optional and non-fatal — if any fails, text search still works.

Supported Formats

Format	Content Type	Notes
PDF	`application/pdf`	Full text extraction with optional OCR
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Word documents
PPTX	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	PowerPoint
HTML	`text/html`	Web pages
Markdown	`text/markdown`	Markdown files
CSV	`text/csv`	Tabular data
Images	`image/png`, `image/jpeg`, `image/tiff`, `image/webp`, `image/bmp`	OCR text extraction

!!! info "Maximum file size" The default maximum is 50 MB per document. This can be adjusted via the INGESTION_MAX_DOC_SIZE environment variable.

Graceful Degradation

The enrich stage (embedding, VLM descriptions, visual indexing) is not required for a document to be usable. If any enrichment fails:

The document is still marked as "ready"
Chunks are stored with full text (keyword search works)
Semantic search is unavailable if embeddings failed
Drawing descriptions are unavailable if VLM failed
Visual retrieval is unavailable if ColPali indexing failed
All can be backfilled by re-ingesting

Monitoring Ingestion

Admin UI

The Documents page shows ingestion status for each document:

Gray badge — Pending (not yet started)
Blue badge with spinner — Ingesting (workflow running)
Green badge — Ready (ingestion complete)
Red badge — Error (with details)

When any document is actively ingesting, the page auto-refreshes every 3 seconds. Click on a document to see detailed progress: current stage, percentage, chunk count, and any errors.

API

Poll the ingestion status endpoint for programmatic monitoring:

bash

GET /api/admin/documents/{document_id}/ingestion-status?tenant_id=your-tenant

Response:

json

{
  "document_id": "a1b2c3d4-...",
  "ingestion_status": "ingesting",
  "stage": "embed",
  "progress_pct": 65,
  "chunk_count": 42,
  "revision_id": "e5f6g7h8-...",
  "error": null,
  "started_at": "2026-02-17T10:30:00Z",
  "completed_at": null
}

Stages progress through: validate → create_revision → parse_and_chunk → enrich → finalize → done.

Re-Ingestion

To re-process a document (e.g., after changing chunking settings or to backfill embeddings):

bash

POST /api/admin/documents/{document_id}/reingest?tenant_id=your-tenant

This creates a new revision — the previous revision and its chunks are preserved. The read_doc and search_docs tools always use the latest successful revision.

!!! warning "Concurrent re-ingestion" A document that is already being ingested cannot be re-ingested simultaneously. The API returns 409 Conflict until the current workflow completes.

Revisions

Each ingestion run creates an immutable revision:

Stores the raw parsed text and structured metadata (pages, sections)
Preserves a content hash for deduplication
Old revisions and their chunks remain in the database
read_doc always serves from the latest successful revision (no S3 download needed)
Revision number auto-increments per document

Backpressure

To prevent a single tenant from overwhelming the ingestion pipeline, Martha enforces a per-tenant concurrency limit (default: 5 concurrent workflows).

When a tenant hits the limit, upload and re-ingest requests return 429 Too Many Requests. The quota is released automatically when a workflow completes (success or failure) or after a TTL safety valve expires (default: 1 hour).

Configuration

All ingestion settings can be overridden via environment variables:

Variable	Default	Description
`INGESTION_CHUNK_SIZE`	`500`	Target tokens per chunk
`INGESTION_CHUNK_OVERLAP`	`50`	Token overlap between chunks (~10%)
`INGESTION_TOKENIZER`	`cl100k_base`	Tokenizer model for chunk sizing
`INGESTION_EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model (any LiteLLM-compatible)
`INGESTION_EMBEDDING_DIMS`	`1536`	Expected embedding dimensions
`INGESTION_EMBEDDING_BATCH`	`100`	Chunks per embedding API call
`INGESTION_EMBEDDING_RETRIES`	`3`	Max retries for embedding failures
`INGESTION_MAX_CONCURRENT`	`5`	Max concurrent workflows per tenant
`INGESTION_QUOTA_TTL`	`3600`	Quota safety valve TTL in seconds
`INGESTION_PARSE_TIMEOUT`	`600`	Parse activity timeout in seconds
`INGESTION_CHUNK_TIMEOUT`	`120`	Chunk activity timeout in seconds
`INGESTION_EMBED_TIMEOUT`	`300`	Embed activity timeout in seconds
`INGESTION_OCR_ENABLED`	`true`	Enable OCR for PDFs and images
`INGESTION_MAX_DOC_SIZE`	`52428800`	Maximum document size in bytes (50 MB)
`INGESTION_MAX_ACTIVITIES`	`2`	Max concurrent activities per worker
`INGESTION_VLM_ENABLED`	`false`	Enable VLM drawing descriptions
`INGESTION_VLM_MODEL`	`gemini/gemini-3-flash`	VLM model (any LiteLLM-compatible)
`INGESTION_VLM_MAX_TOKENS`	`8192`	Max tokens for VLM responses
`INGESTION_MAX_DRAWING_PAGES`	`100`	Max drawing pages to render per document
`INGESTION_VISION_RETRIEVAL_ENABLED`	`false`	Enable ColPali visual indexing
`COLIVARA_API_KEY`	(required if vision enabled)	ColiVara API key
`COLIVARA_BASE_URL`	`https://api.colivara.com`	ColiVara API endpoint
`INGESTION_PAGE_IMAGE_DPI`	`144`	Page rendering DPI (144 = 2x scale)

Running the Ingestion Worker

The ingestion worker runs as a separate process from the main API:

bash

python -m ingestion.worker

In Docker, it runs as its own container using Dockerfile.ingestion, which includes the Docling and PyTorch dependencies that the main API does not need.

yaml

# docker-compose snippet
ingestion-worker:
  build:
    dockerfile: Dockerfile.ingestion
  environment:
    - TEMPORAL_ADDRESS=temporal:7233
    - REDIS_URL=redis://redis:6379
    - DATABASE_URL=postgresql://...
  depends_on:
    - temporal
    - redis
    - postgres

Document Ingestion ​

How It Works ​

Page Rendering and Visual Processing ​

Supported Formats ​

Graceful Degradation ​

Monitoring Ingestion ​

Admin UI ​

API ​

Re-Ingestion ​

Revisions ​

Backpressure ​

Configuration ​

Running the Ingestion Worker ​

Document Ingestion

How It Works

Page Rendering and Visual Processing

Supported Formats

Graceful Degradation

Monitoring Ingestion

Admin UI

API

Re-Ingestion

Revisions

Backpressure

Configuration

Running the Ingestion Worker