Skip to content

Agent memory runbook — supersede lifecycle (C2-3b)

Operator playbook for the contradiction-handling subsystem introduced in #240. Covers detection, intervention, and tuning. Pair with the user-facing guide at docs/guides/agent-memory.md.

What runs where

SurfaceComponentLives in
Agent write pathremember_fact platform fncore/memory_remember_tools.py
Async judgejudge_memory_items_activitytemporal/activities.py
DispatcherMemoryEmbedDispatcherWorkflowtemporal/memory_embed_workflow.py
Judge prompt + parserrun_judgecore/memory_judge.py
Candidate lookup_find_supersede_candidatescore/memory_index/search.py
Metricscounters + histogramscore/memory_metrics.py
Audit log writerinline INSERT ExecutionAuditLogtemporal/activities.py

Healthy steady state

SignalMimir queryTarget
Judge throughputsum(rate(martha_memory_judge_total[5m]))> 0 when fact writes occur
Supersede ratesum by (action) (rate(martha_memory_judge_total{action="supersede"}[5m]))proportional to contradiction frequency
Fail-open ratesum(rate(martha_memory_judge_total{action="judge_fail_open"}[5m]))≤ 5% of total
Lag p95histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_lag_seconds_bucket[5m])))≤ 30s on a single worker
Latency p95histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_latency_seconds_bucket[5m])))≤ 2s for Haiku-class judge

Common alerts

A1 — sustained fail-open rate > 5%

Likely cause. Judge model unreachable, rate-limited, or returning malformed output.

Triage.

  1. Pull recent worker logs: docker logs martha-worker --tail 200 | grep memory_judge.fail_open. The structured WARN line carries reason, model, and error_type — never content.
  2. If reason=api_error and error_type=AuthenticationError → expired API key. Rotate.
  3. If reason=timeout → transient model overload. Watch — recovers when upstream does.
  4. If reason=parse_error consistently → judge is returning non-JSON. Check the model id is the expected one (typo in env var, accidental rollout to a different model).

Mitigations.

  • Soft kill switch. Set MARTHA_MEMORY_JUDGE_MODEL= (empty string) on the worker and restart. The activity becomes a no-op judge_at setter — pending rows still drain so the dispatcher SELECT doesn't loop, but no supersede attempts. No data lost.
  • Tighten threshold. Lower MARTHA_MEMORY_JUDGE_THRESHOLD (e.g. 0.3) to drastically reduce candidate flow until the judge stabilizes. Threshold is a cosine-DISTANCE cutoff; smaller = stricter (only very-similar pairs become candidates).

A2 — sustained lag p95 > 5min

Likely cause. Worker outage, dispatcher signal-loss, or one tenant flooding the global FIFO.

Triage.

  1. docker exec martha-postgres-dev psql -U martha -c "SELECT tenant_id, COUNT(*) FROM memory_items WHERE source_kind='fact' AND embedding IS NOT NULL AND judge_at IS NULL AND superseded_by IS NULL AND deleted_at IS NULL GROUP BY tenant_id ORDER BY 2 DESC LIMIT 10;" — surfaces the tenant-level backlog skew.
  2. If one tenant dominates: this is the documented multi-tenant fairness limitation (rabbit-hole #20). Per-tenant fair-queue dispatcher lands in C2-4 (#242). Mitigation: temporarily raise MARTHA_MEMORY_JUDGE_BATCH_SIZE=50.
  3. If all tenants backlogged: worker is down. Restart martha-worker and watch the lag drop.

A3 — supersede rate goes to zero unexpectedly

Likely cause. Judge model rolled to one that's overly conservative, OR the soft kill switch is on.

Triage.

  1. Check MARTHA_MEMORY_JUDGE_MODEL on the worker. Empty string = soft kill switch.
  2. Check martha_memory_judge_total{action="judge_fail_open"} rate — high indicates judge is failing rather than just KEEPing.
  3. Check martha_memory_judge_total{action="candidates_top_k_match"} rate — high means candidates are being found but the judge says all-KEEP. Likely indicates a model regression; revert MARTHA_MEMORY_JUDGE_MODEL to a known-good value.

Manual procedures

Force re-judge of a single row

When you want the activity to revisit a row's contradiction decision (e.g., after fixing a judge regression):

sql
UPDATE memory_items
SET judge_at = NULL
WHERE id = '<row_id>';

Safety: judge_at = NOW() is always a terminal state in normal operation. Re-judging cannot loop because the activity sets judge_at again on the next pass. The dispatcher's pending SELECT picks up the row on the next tick. (Verified — risk R-21 in the spec.)

If you want to re-judge ALL rows for a tenant (e.g., after a model upgrade):

sql
UPDATE memory_items
SET judge_at = NULL
WHERE tenant_id = '<tenant>'
  AND source_kind = 'fact'
  AND superseded_by IS NULL
  AND deleted_at IS NULL;

Walk a supersede chain manually

To follow a chain backwards from a live row:

sql
WITH RECURSIVE chain AS (
    SELECT id, superseded_by, content, indexed_at, 0 AS depth
    FROM memory_items WHERE superseded_by = '<live_row_id>'
    UNION ALL
    SELECT m.id, m.superseded_by, m.content, m.indexed_at, c.depth + 1
    FROM memory_items m
    JOIN chain c ON m.superseded_by = c.id
)
SELECT * FROM chain ORDER BY depth;

(Admin UI lands in C2-4 with a chain-walker view.)

Rollback procedures

SeverityAction
Soft (judge issues)Set MARTHA_MEMORY_JUDGE_MODEL= empty, restart worker. Pending rows still drain (no-op judge_at setter). No data loss; supersede links remain.
Soft (dispatcher issues)De-register judge_memory_items_activity from the worker by reverting temporal/worker.py, restart. Pending rows pile up safely; re-register after fix.
Hard (column issues)Run alembic down-migration alembic downgrade -1. Drops the three columns + two partial indexes. Pre-existing rows unaffected. Supersede links lost; rows remain.

Tuning levers

Env varDefaultWhen to change
MARTHA_MEMORY_JUDGE_MODELclaude-haiku-4-5Swap to a faster/cheaper model when budget pressure rises. Empty string = soft kill.
MARTHA_MEMORY_JUDGE_TOPK5Raise to 10 if you observe contradiction misses (rare — our threshold is generous). Lower to 3 if cost dominates.
MARTHA_MEMORY_JUDGE_THRESHOLD0.7Cosine-DISTANCE cutoff. Lower (e.g. 0.5) to filter out more weakly-related pairs; raise (e.g. 0.85) to give the judge more candidates and accept higher LLM cost. Default 0.7 empirically catches typical contradiction pairs (vegan↔fish, Lisbon↔Berlin, etc.).
MARTHA_MEMORY_JUDGE_TIMEOUT_S5.0Raise if the judge model is slow (Sonnet/Opus). Lower to 3.0 if you need tighter tick latency.
MARTHA_MEMORY_JUDGE_BATCH_SIZE20Raise to 50 to drain backlog faster. Lower to 5 to reduce per-tick wall-clock when judge model is slow.
MARTHA_MEMORY_ANONYMIZE_SOFT_TTL_DAYS30Soft-deleted memory rows older than this are eligible for anonymization.
MARTHA_MEMORY_ANONYMIZE_SUP_TTL_DAYS90Superseded memory rows older than this are eligible for anonymization.
MARTHA_MEMORY_ANONYMIZE_BATCH_SIZE100Max rows processed by one retention activity tick.
MARTHA_MEMORY_ANONYMIZE_DRY_RUNtrue in prod rolloutLogs/counts eligible rows without writing. Use for the first observation window.

Retention anonymizer

The C2-3c retention worker anonymizes retired memory rows instead of deleting them. It replaces content with [REDACTED], clears user_id, agent_id, and embedding, preserves the row id and superseded_by links, and inserts an immutable memory.anonymize audit event without content.

Operational checks:

  • Dry-run: set MARTHA_MEMORY_ANONYMIZE_DRY_RUN=true, restart martha-worker, wait for the daily MemoryRetentionWorkflow, then inspect martha_memory_anonymize_total{action="dry_run"}.
  • Live flip: set dry-run false only after the dry-run count matches expectation. Watch martha_memory_anonymize_total{action="anonymize_soft"} and martha_memory_anonymize_total{action="anonymize_supersede"}.
  • Safety floor: if safety_floor_blocked increments, inspect the tenant's due-row ratio before setting the override MARTHA_MEMORY_ANONYMIZE_ALLOW_MASS=true.

What to NEVER do

  • Never delete rows from execution_audit_log. The DB trigger blocks it (by design); supersede decisions must remain auditable.
  • Never UPDATE memory_items SET superseded_by = ... WHERE ... from a manual SQL session unless you're in incident response and have written authorization. The activity is the only legitimate writer; a hand-rolled UPDATE bypasses the security floor (tenant + identity WHERE clauses).
  • Never truncate memory_items to "reset" — chats reference these rows. Use the C2-3c retention anonymizer for cleanup.

References

  • Spec: dev_docs/specs/agent-memory-superseded-by.md
  • Pre-deploy baseline: dev_docs/observability/agent-memory-c2-3b-baseline.md
  • Issue: #240 (parent: #231)
  • Predecessor runbook: agent-memory C2-2c remember_fact (no separate runbook — write path is synchronous and self-contained)

Martha is built by aiaiai-pt.