Document Ingestion
When a document is uploaded to a collection, Martha automatically parses it, splits it into searchable chunks, and generates vector embeddings. This process runs as a durable Temporal workflow on a dedicated task queue, isolated from the main API.
How It Works
Upload → Validate → Parse+Chunk → Enrich (parallel) → Finalize
│ │ │ │
│ │ ├─ Embed └─ Mark "ready"
│ │ ├─ VLM Describe
│ │ └─ ColPali Index
│ └─ Docling parse + HybridChunker + page classification
└─ Size, type, and tenant checksEach stage is a separate Temporal activity with independent retry and timeout. The enrich stage runs three activities in parallel — all are non-fatal, so failures degrade gracefully rather than blocking the pipeline.
Page Rendering and Visual Processing
For PDFs, every page is rendered to PNG and uploaded to storage (up to MAX_RENDER_PAGES, default 300). This enables:
get_page_imageandvisual_searchtools to return page images to agents- Visual indexing via ColPali (ColiVara SDK) for image-based retrieval across all pages
Additionally, pages are classified as text, drawing, or table based on element bounding boxes. Drawing pages get extra treatment:
- VLM descriptions generated by Gemini 3 Flash, stored as searchable
drawing_descriptionchunks
All enrichment activities (embedding, VLM, vision indexing) are optional and non-fatal — if any fails, text search still works.
Supported Formats
| Format | Content Type | Notes |
|---|---|---|
application/pdf | Full text extraction with optional OCR | |
| DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Word documents |
| PPTX | application/vnd.openxmlformats-officedocument.presentationml.presentation | PowerPoint |
| HTML | text/html | Web pages |
| Markdown | text/markdown | Markdown files |
| CSV | text/csv | Tabular data |
| Images | image/png, image/jpeg, image/tiff, image/webp, image/bmp | OCR text extraction |
!!! info "Maximum file size" The default maximum is 50 MB per document. This can be adjusted via the INGESTION_MAX_DOC_SIZE environment variable.
Graceful Degradation
The enrich stage (embedding, VLM descriptions, visual indexing) is not required for a document to be usable. If any enrichment fails:
- The document is still marked as "ready"
- Chunks are stored with full text (keyword search works)
- Semantic search is unavailable if embeddings failed
- Drawing descriptions are unavailable if VLM failed
- Visual retrieval is unavailable if ColPali indexing failed
- All can be backfilled by re-ingesting
Monitoring Ingestion
Admin UI
The Documents page shows ingestion status for each document:
- Gray badge — Pending (not yet started)
- Blue badge with spinner — Ingesting (workflow running)
- Green badge — Ready (ingestion complete)
- Red badge — Error (with details)
When any document is actively ingesting, the page auto-refreshes every 3 seconds. Click on a document to see detailed progress: current stage, percentage, chunk count, and any errors.
API
Poll the ingestion status endpoint for programmatic monitoring:
GET /api/admin/documents/{document_id}/ingestion-status?tenant_id=your-tenantResponse:
{
"document_id": "a1b2c3d4-...",
"ingestion_status": "ingesting",
"stage": "embed",
"progress_pct": 65,
"chunk_count": 42,
"revision_id": "e5f6g7h8-...",
"error": null,
"started_at": "2026-02-17T10:30:00Z",
"completed_at": null
}Stages progress through: validate → create_revision → parse_and_chunk → enrich → finalize → done.
Re-Ingestion
To re-process a document (e.g., after changing chunking settings or to backfill embeddings):
POST /api/admin/documents/{document_id}/reingest?tenant_id=your-tenantThis creates a new revision — the previous revision and its chunks are preserved. The read_doc and search_docs tools always use the latest successful revision.
!!! warning "Concurrent re-ingestion" A document that is already being ingested cannot be re-ingested simultaneously. The API returns 409 Conflict until the current workflow completes.
Revisions
Each ingestion run creates an immutable revision:
- Stores the raw parsed text and structured metadata (pages, sections)
- Preserves a content hash for deduplication
- Old revisions and their chunks remain in the database
read_docalways serves from the latest successful revision (no S3 download needed)- Revision number auto-increments per document
Backpressure
To prevent a single tenant from overwhelming the ingestion pipeline, Martha enforces a per-tenant concurrency limit (default: 5 concurrent workflows).
When a tenant hits the limit, upload and re-ingest requests return 429 Too Many Requests. The quota is released automatically when a workflow completes (success or failure) or after a TTL safety valve expires (default: 1 hour).
Configuration
All ingestion settings can be overridden via environment variables:
| Variable | Default | Description |
|---|---|---|
INGESTION_CHUNK_SIZE | 500 | Target tokens per chunk |
INGESTION_CHUNK_OVERLAP | 50 | Token overlap between chunks (~10%) |
INGESTION_TOKENIZER | cl100k_base | Tokenizer model for chunk sizing |
INGESTION_EMBEDDING_MODEL | text-embedding-3-small | Embedding model (any LiteLLM-compatible) |
INGESTION_EMBEDDING_DIMS | 1536 | Expected embedding dimensions |
INGESTION_EMBEDDING_BATCH | 100 | Chunks per embedding API call |
INGESTION_EMBEDDING_RETRIES | 3 | Max retries for embedding failures |
INGESTION_MAX_CONCURRENT | 5 | Max concurrent workflows per tenant |
INGESTION_QUOTA_TTL | 3600 | Quota safety valve TTL in seconds |
INGESTION_PARSE_TIMEOUT | 600 | Parse activity timeout in seconds |
INGESTION_CHUNK_TIMEOUT | 120 | Chunk activity timeout in seconds |
INGESTION_EMBED_TIMEOUT | 300 | Embed activity timeout in seconds |
INGESTION_OCR_ENABLED | true | Enable OCR for PDFs and images |
INGESTION_MAX_DOC_SIZE | 52428800 | Maximum document size in bytes (50 MB) |
INGESTION_MAX_ACTIVITIES | 2 | Max concurrent activities per worker |
INGESTION_VLM_ENABLED | false | Enable VLM drawing descriptions |
INGESTION_VLM_MODEL | gemini/gemini-3-flash | VLM model (any LiteLLM-compatible) |
INGESTION_VLM_MAX_TOKENS | 8192 | Max tokens for VLM responses |
INGESTION_MAX_DRAWING_PAGES | 100 | Max drawing pages to render per document |
INGESTION_VISION_RETRIEVAL_ENABLED | false | Enable ColPali visual indexing |
COLIVARA_API_KEY | (required if vision enabled) | ColiVara API key |
COLIVARA_BASE_URL | https://api.colivara.com | ColiVara API endpoint |
INGESTION_PAGE_IMAGE_DPI | 144 | Page rendering DPI (144 = 2x scale) |
Running the Ingestion Worker
The ingestion worker runs as a separate process from the main API:
python -m ingestion.workerIn Docker, it runs as its own container using Dockerfile.ingestion, which includes the Docling and PyTorch dependencies that the main API does not need.
# docker-compose snippet
ingestion-worker:
build:
dockerfile: Dockerfile.ingestion
environment:
- TEMPORAL_ADDRESS=temporal:7233
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://...
depends_on:
- temporal
- redis
- postgres