Agent memory runbook — supersede lifecycle (C2-3b)
Operator playbook for the contradiction-handling subsystem introduced in #240. Covers detection, intervention, and tuning. Pair with the user-facing guide at docs/guides/agent-memory.md.
What runs where
| Surface | Component | Lives in |
|---|---|---|
| Agent write path | remember_fact platform fn | core/memory_remember_tools.py |
| Async judge | judge_memory_items_activity | temporal/activities.py |
| Dispatcher | MemoryEmbedDispatcherWorkflow | temporal/memory_embed_workflow.py |
| Judge prompt + parser | run_judge | core/memory_judge.py |
| Candidate lookup | _find_supersede_candidates | core/memory_index/search.py |
| Metrics | counters + histograms | core/memory_metrics.py |
| Audit log writer | inline INSERT ExecutionAuditLog | temporal/activities.py |
Healthy steady state
| Signal | Mimir query | Target |
|---|---|---|
| Judge throughput | sum(rate(martha_memory_judge_total[5m])) | > 0 when fact writes occur |
| Supersede rate | sum by (action) (rate(martha_memory_judge_total{action="supersede"}[5m])) | proportional to contradiction frequency |
| Fail-open rate | sum(rate(martha_memory_judge_total{action="judge_fail_open"}[5m])) | ≤ 5% of total |
| Lag p95 | histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_lag_seconds_bucket[5m]))) | ≤ 30s on a single worker |
| Latency p95 | histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_latency_seconds_bucket[5m]))) | ≤ 2s for Haiku-class judge |
Common alerts
A1 — sustained fail-open rate > 5%
Likely cause. Judge model unreachable, rate-limited, or returning malformed output.
Triage.
- Pull recent worker logs:
docker logs martha-worker --tail 200 | grep memory_judge.fail_open. The structured WARN line carriesreason,model, anderror_type— never content. - If
reason=api_erroranderror_type=AuthenticationError→ expired API key. Rotate. - If
reason=timeout→ transient model overload. Watch — recovers when upstream does. - If
reason=parse_errorconsistently → judge is returning non-JSON. Check the model id is the expected one (typo in env var, accidental rollout to a different model).
Mitigations.
- Soft kill switch. Set
MARTHA_MEMORY_JUDGE_MODEL=(empty string) on the worker and restart. The activity becomes a no-opjudge_atsetter — pending rows still drain so the dispatcher SELECT doesn't loop, but no supersede attempts. No data lost. - Tighten threshold. Lower
MARTHA_MEMORY_JUDGE_THRESHOLD(e.g.0.3) to drastically reduce candidate flow until the judge stabilizes. Threshold is a cosine-DISTANCE cutoff; smaller = stricter (only very-similar pairs become candidates).
A2 — sustained lag p95 > 5min
Likely cause. Worker outage, dispatcher signal-loss, or one tenant flooding the global FIFO.
Triage.
docker exec martha-postgres-dev psql -U martha -c "SELECT tenant_id, COUNT(*) FROM memory_items WHERE source_kind='fact' AND embedding IS NOT NULL AND judge_at IS NULL AND superseded_by IS NULL AND deleted_at IS NULL GROUP BY tenant_id ORDER BY 2 DESC LIMIT 10;"— surfaces the tenant-level backlog skew.- If one tenant dominates: this is the documented multi-tenant fairness limitation (rabbit-hole #20). Per-tenant fair-queue dispatcher lands in C2-4 (#242). Mitigation: temporarily raise
MARTHA_MEMORY_JUDGE_BATCH_SIZE=50. - If all tenants backlogged: worker is down. Restart
martha-workerand watch the lag drop.
A3 — supersede rate goes to zero unexpectedly
Likely cause. Judge model rolled to one that's overly conservative, OR the soft kill switch is on.
Triage.
- Check
MARTHA_MEMORY_JUDGE_MODELon the worker. Empty string = soft kill switch. - Check
martha_memory_judge_total{action="judge_fail_open"}rate — high indicates judge is failing rather than just KEEPing. - Check
martha_memory_judge_total{action="candidates_top_k_match"}rate — high means candidates are being found but the judge says all-KEEP. Likely indicates a model regression; revertMARTHA_MEMORY_JUDGE_MODELto a known-good value.
Manual procedures
Force re-judge of a single row
When you want the activity to revisit a row's contradiction decision (e.g., after fixing a judge regression):
UPDATE memory_items
SET judge_at = NULL
WHERE id = '<row_id>';Safety: judge_at = NOW() is always a terminal state in normal operation. Re-judging cannot loop because the activity sets judge_at again on the next pass. The dispatcher's pending SELECT picks up the row on the next tick. (Verified — risk R-21 in the spec.)
If you want to re-judge ALL rows for a tenant (e.g., after a model upgrade):
UPDATE memory_items
SET judge_at = NULL
WHERE tenant_id = '<tenant>'
AND source_kind = 'fact'
AND superseded_by IS NULL
AND deleted_at IS NULL;Walk a supersede chain manually
To follow a chain backwards from a live row:
WITH RECURSIVE chain AS (
SELECT id, superseded_by, content, indexed_at, 0 AS depth
FROM memory_items WHERE superseded_by = '<live_row_id>'
UNION ALL
SELECT m.id, m.superseded_by, m.content, m.indexed_at, c.depth + 1
FROM memory_items m
JOIN chain c ON m.superseded_by = c.id
)
SELECT * FROM chain ORDER BY depth;(Admin UI lands in C2-4 with a chain-walker view.)
Rollback procedures
| Severity | Action |
|---|---|
| Soft (judge issues) | Set MARTHA_MEMORY_JUDGE_MODEL= empty, restart worker. Pending rows still drain (no-op judge_at setter). No data loss; supersede links remain. |
| Soft (dispatcher issues) | De-register judge_memory_items_activity from the worker by reverting temporal/worker.py, restart. Pending rows pile up safely; re-register after fix. |
| Hard (column issues) | Run alembic down-migration alembic downgrade -1. Drops the three columns + two partial indexes. Pre-existing rows unaffected. Supersede links lost; rows remain. |
Tuning levers
| Env var | Default | When to change |
|---|---|---|
MARTHA_MEMORY_JUDGE_MODEL | claude-haiku-4-5 | Swap to a faster/cheaper model when budget pressure rises. Empty string = soft kill. |
MARTHA_MEMORY_JUDGE_TOPK | 5 | Raise to 10 if you observe contradiction misses (rare — our threshold is generous). Lower to 3 if cost dominates. |
MARTHA_MEMORY_JUDGE_THRESHOLD | 0.7 | Cosine-DISTANCE cutoff. Lower (e.g. 0.5) to filter out more weakly-related pairs; raise (e.g. 0.85) to give the judge more candidates and accept higher LLM cost. Default 0.7 empirically catches typical contradiction pairs (vegan↔fish, Lisbon↔Berlin, etc.). |
MARTHA_MEMORY_JUDGE_TIMEOUT_S | 5.0 | Raise if the judge model is slow (Sonnet/Opus). Lower to 3.0 if you need tighter tick latency. |
MARTHA_MEMORY_JUDGE_BATCH_SIZE | 20 | Raise to 50 to drain backlog faster. Lower to 5 to reduce per-tick wall-clock when judge model is slow. |
MARTHA_MEMORY_ANONYMIZE_SOFT_TTL_DAYS | 30 | Soft-deleted memory rows older than this are eligible for anonymization. |
MARTHA_MEMORY_ANONYMIZE_SUP_TTL_DAYS | 90 | Superseded memory rows older than this are eligible for anonymization. |
MARTHA_MEMORY_ANONYMIZE_BATCH_SIZE | 100 | Max rows processed by one retention activity tick. |
MARTHA_MEMORY_ANONYMIZE_DRY_RUN | true in prod rollout | Logs/counts eligible rows without writing. Use for the first observation window. |
Retention anonymizer
The C2-3c retention worker anonymizes retired memory rows instead of deleting them. It replaces content with [REDACTED], clears user_id, agent_id, and embedding, preserves the row id and superseded_by links, and inserts an immutable memory.anonymize audit event without content.
Operational checks:
- Dry-run: set
MARTHA_MEMORY_ANONYMIZE_DRY_RUN=true, restartmartha-worker, wait for the dailyMemoryRetentionWorkflow, then inspectmartha_memory_anonymize_total{action="dry_run"}. - Live flip: set dry-run false only after the dry-run count matches expectation. Watch
martha_memory_anonymize_total{action="anonymize_soft"}andmartha_memory_anonymize_total{action="anonymize_supersede"}. - Safety floor: if
safety_floor_blockedincrements, inspect the tenant's due-row ratio before setting the overrideMARTHA_MEMORY_ANONYMIZE_ALLOW_MASS=true.
What to NEVER do
- Never delete rows from
execution_audit_log. The DB trigger blocks it (by design); supersede decisions must remain auditable. - Never
UPDATE memory_items SET superseded_by = ... WHERE ...from a manual SQL session unless you're in incident response and have written authorization. The activity is the only legitimate writer; a hand-rolled UPDATE bypasses the security floor (tenant + identity WHERE clauses). - Never truncate
memory_itemsto "reset" — chats reference these rows. Use the C2-3c retention anonymizer for cleanup.