Skip to content

R2 Folder Sync

Automatically ingest documents uploaded to Cloudflare R2 buckets. Users upload structured folder hierarchies via any S3-compatible tool (rclone, AWS CLI, R2 console), and Martha detects changes, maps folders to collections, and triggers the ingestion pipeline.

How It Works

S3 client uploads to R2
    -> R2 event notification -> Cloudflare Queue
    -> Martha polls queue (every 5s)
    -> Creates collection + document records
    -> Triggers DocumentIngestionWorkflow
    -> Document is parsed, chunked, embedded

A daily reconciliation workflow catches any events missed due to downtime.

R2 Key Convention

Files must follow this path structure:

{tenant_id}/{collection_slug}/{filename}
SegmentMaps toExample
tenant_idData isolation boundaryacme-corp
collection_slugDocumentCollection (auto-created if needed)site-surveys
filenameDocument recordreport.pdf

Deeper paths are flattened into the filename:

acme-corp/site-surveys/photos/tower-north.jpg
-> tenant: acme-corp
-> collection: site-surveys
-> filename: photos/tower-north.jpg

Configuration

SettingEnv VarDefaultDescription
Enable syncR2_SYNC_ENABLEDfalseMaster toggle
Account IDCF_ACCOUNT_IDrequiredCloudflare account ID
Queue IDCF_QUEUE_IDrequiredCloudflare Queue ID
API TokenCF_API_TOKENrequiredToken with queues_read + queues_write
Poll intervalR2_SYNC_POLL_INTERVAL_SECONDS5Seconds between queue polls
Batch sizeR2_SYNC_BATCH_SIZE100Messages per poll (max 100)
Auto-create collectionsR2_SYNC_AUTO_CREATE_COLLECTIONStrueCreate collections from folder names
Reconciliation cronR2_SYNC_RECONCILIATION_CRON0 3 * * *Daily reconciliation schedule
Visibility timeoutR2_SYNC_VISIBILITY_TIMEOUT_MS300000Message lock duration (5 min)

Cloudflare Setup (One-Time)

bash
# Create the queue
wrangler queues create martha-r2-events

# Enable HTTP pull consumer
wrangler queues consumer http add martha-r2-events

# Add notification rules on your R2 bucket
wrangler r2 bucket notification create <bucket-name> \
  --queue=martha-r2-events --event-type=object-create

wrangler r2 bucket notification create <bucket-name> \
  --queue=martha-r2-events --event-type=object-delete

Upload Methods

MethodUse Case
AWS CLI / rclone / s3cmdBulk uploads from ops teams
R2 console (dashboard)Ad-hoc uploads
Presigned URLs from Martha APIWeb/CLI uploads (future)
Super SlurperInitial migration from S3/MinIO

!!! warning "wrangler uploads don't trigger notifications" wrangler r2 object put uses an internal API that does not fire event notifications. Always use S3-compatible tools (AWS CLI, rclone, boto3) for uploads that should trigger sync.

Admin API

Trigger Reconciliation

http
POST /api/admin/r2-sync/reconcile
Content-Type: application/json
Authorization: Bearer <jwt>

{
  "tenant_id": "acme-corp"
}

Starts R2ReconciliationWorkflow which scans R2 objects vs DB records for the given tenant, processing missed creates, changed files, and orphaned records.

Change Detection

  • New files: eTag recorded, ingestion triggered
  • Changed files: different eTag detected, re-ingestion triggered
  • Deleted files: document soft-deleted (is_active=false), chunks preserved for citation stability
  • Duplicate notifications: same eTag = skip (idempotent)

Backpressure

Large batch uploads (e.g., 1,000 files via rclone) are naturally throttled by the existing per-tenant ingestion quota (MAX_CONCURRENT_PER_TENANT=5). Documents queue in Temporal and process in order.

Martha is built by aiaiai-pt.