Agent memory runbook — supersede lifecycle (C2-3b)

Operator playbook for the contradiction-handling subsystem introduced in #240. Covers detection, intervention, and tuning. Pair with the user-facing guide at docs/guides/agent-memory.md.

What runs where

Surface	Component	Lives in
Agent write path	`remember_fact` platform fn	`core/memory_remember_tools.py`
Async judge	`judge_memory_items_activity`	`temporal/activities.py`
Dispatcher	`MemoryEmbedDispatcherWorkflow`	`temporal/memory_embed_workflow.py`
Judge prompt + parser	`run_judge`	`core/memory_judge.py`
Candidate lookup	`_find_supersede_candidates`	`core/memory_index/search.py`
Metrics	counters + histograms	`core/memory_metrics.py`
Audit log writer	inline `INSERT ExecutionAuditLog`	`temporal/activities.py`

Healthy steady state

Signal	Mimir query	Target
Judge throughput	`sum(rate(martha_memory_judge_total[5m]))`	> 0 when fact writes occur
Supersede rate	`sum by (action) (rate(martha_memory_judge_total{action="supersede"}[5m]))`	proportional to contradiction frequency
Fail-open rate	`sum(rate(martha_memory_judge_total{action="judge_fail_open"}[5m]))`	≤ 5% of total
Lag p95	`histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_lag_seconds_bucket[5m])))`	≤ 30s on a single worker
Latency p95	`histogram_quantile(0.95, sum by (le) (rate(martha_memory_judge_latency_seconds_bucket[5m])))`	≤ 2s for Haiku-class judge

Common alerts

A1 — sustained fail-open rate > 5%

Likely cause. Judge model unreachable, rate-limited, or returning malformed output.

Triage.

Pull recent worker logs: docker logs martha-worker --tail 200 | grep memory_judge.fail_open. The structured WARN line carries reason, model, and error_type — never content.
If reason=api_error and error_type=AuthenticationError → expired API key. Rotate.
If reason=timeout → transient model overload. Watch — recovers when upstream does.
If reason=parse_error consistently → judge is returning non-JSON. Check the model id is the expected one (typo in env var, accidental rollout to a different model).

Mitigations.

Soft kill switch. Set MARTHA_MEMORY_JUDGE_MODEL= (empty string) on the worker and restart. The activity becomes a no-op judge_at setter — pending rows still drain so the dispatcher SELECT doesn't loop, but no supersede attempts. No data lost.
Tighten threshold. Lower MARTHA_MEMORY_JUDGE_THRESHOLD (e.g. 0.3) to drastically reduce candidate flow until the judge stabilizes. Threshold is a cosine-DISTANCE cutoff; smaller = stricter (only very-similar pairs become candidates).

A2 — sustained lag p95 > 5min

Likely cause. Worker outage, dispatcher signal-loss, or one tenant flooding the global FIFO.

Triage.

docker exec martha-postgres-dev psql -U martha -c "SELECT tenant_id, COUNT(*) FROM memory_items WHERE source_kind='fact' AND embedding IS NOT NULL AND judge_at IS NULL AND superseded_by IS NULL AND deleted_at IS NULL GROUP BY tenant_id ORDER BY 2 DESC LIMIT 10;" — surfaces the tenant-level backlog skew.
If one tenant dominates: this is the documented multi-tenant fairness limitation (rabbit-hole #20). Per-tenant fair-queue dispatcher lands in C2-4 (#242). Mitigation: temporarily raise MARTHA_MEMORY_JUDGE_BATCH_SIZE=50.
If all tenants backlogged: worker is down. Restart martha-worker and watch the lag drop.

A3 — supersede rate goes to zero unexpectedly

Likely cause. Judge model rolled to one that's overly conservative, OR the soft kill switch is on.

Triage.

Check MARTHA_MEMORY_JUDGE_MODEL on the worker. Empty string = soft kill switch.
Check martha_memory_judge_total{action="judge_fail_open"} rate — high indicates judge is failing rather than just KEEPing.
Check martha_memory_judge_total{action="candidates_top_k_match"} rate — high means candidates are being found but the judge says all-KEEP. Likely indicates a model regression; revert MARTHA_MEMORY_JUDGE_MODEL to a known-good value.

Manual procedures

Force re-judge of a single row

When you want the activity to revisit a row's contradiction decision (e.g., after fixing a judge regression):

sql

UPDATE memory_items
SET judge_at = NULL
WHERE id = '<row_id>';

Safety: judge_at = NOW() is always a terminal state in normal operation. Re-judging cannot loop because the activity sets judge_at again on the next pass. The dispatcher's pending SELECT picks up the row on the next tick. (Verified — risk R-21 in the spec.)

If you want to re-judge ALL rows for a tenant (e.g., after a model upgrade):

sql

UPDATE memory_items
SET judge_at = NULL
WHERE tenant_id = '<tenant>'
  AND source_kind = 'fact'
  AND superseded_by IS NULL
  AND deleted_at IS NULL;

Walk a supersede chain manually

To follow a chain backwards from a live row:

sql

WITH RECURSIVE chain AS (
    SELECT id, superseded_by, content, indexed_at, 0 AS depth
    FROM memory_items WHERE superseded_by = '<live_row_id>'
    UNION ALL
    SELECT m.id, m.superseded_by, m.content, m.indexed_at, c.depth + 1
    FROM memory_items m
    JOIN chain c ON m.superseded_by = c.id
)
SELECT * FROM chain ORDER BY depth;

(Admin UI lands in C2-4 with a chain-walker view.)

Rollback procedures

Severity	Action
Soft (judge issues)	Set `MARTHA_MEMORY_JUDGE_MODEL=` empty, restart worker. Pending rows still drain (no-op `judge_at` setter). No data loss; supersede links remain.
Soft (dispatcher issues)	De-register `judge_memory_items_activity` from the worker by reverting `temporal/worker.py`, restart. Pending rows pile up safely; re-register after fix.
Hard (column issues)	Run alembic down-migration `alembic downgrade -1`. Drops the three columns + two partial indexes. Pre-existing rows unaffected. Supersede links lost; rows remain.

Tuning levers

Env var	Default	When to change
`MARTHA_MEMORY_JUDGE_MODEL`	`claude-haiku-4-5`	Swap to a faster/cheaper model when budget pressure rises. Empty string = soft kill.
`MARTHA_MEMORY_JUDGE_TOPK`	`5`	Raise to 10 if you observe contradiction misses (rare — our threshold is generous). Lower to 3 if cost dominates.
`MARTHA_MEMORY_JUDGE_THRESHOLD`	`0.7`	Cosine-DISTANCE cutoff. Lower (e.g. `0.5`) to filter out more weakly-related pairs; raise (e.g. `0.85`) to give the judge more candidates and accept higher LLM cost. Default `0.7` empirically catches typical contradiction pairs (vegan↔fish, Lisbon↔Berlin, etc.).
`MARTHA_MEMORY_JUDGE_TIMEOUT_S`	`5.0`	Raise if the judge model is slow (Sonnet/Opus). Lower to `3.0` if you need tighter tick latency.
`MARTHA_MEMORY_JUDGE_BATCH_SIZE`	`20`	Raise to `50` to drain backlog faster. Lower to `5` to reduce per-tick wall-clock when judge model is slow.
`MARTHA_MEMORY_ANONYMIZE_SOFT_TTL_DAYS`	`30`	Soft-deleted memory rows older than this are eligible for anonymization.
`MARTHA_MEMORY_ANONYMIZE_SUP_TTL_DAYS`	`90`	Superseded memory rows older than this are eligible for anonymization.
`MARTHA_MEMORY_ANONYMIZE_BATCH_SIZE`	`100`	Max rows processed by one retention activity tick.
`MARTHA_MEMORY_ANONYMIZE_DRY_RUN`	`true` in prod rollout	Logs/counts eligible rows without writing. Use for the first observation window.

Retention anonymizer

The C2-3c retention worker anonymizes retired memory rows instead of deleting them. It replaces content with [REDACTED], clears user_id, agent_id, and embedding, preserves the row id and superseded_by links, and inserts an immutable memory.anonymize audit event without content.

Operational checks:

Dry-run: set MARTHA_MEMORY_ANONYMIZE_DRY_RUN=true, restart martha-worker, wait for the daily MemoryRetentionWorkflow, then inspect martha_memory_anonymize_total{action="dry_run"}.
Live flip: set dry-run false only after the dry-run count matches expectation. Watch martha_memory_anonymize_total{action="anonymize_soft"} and martha_memory_anonymize_total{action="anonymize_supersede"}.
Safety floor: if safety_floor_blocked increments, inspect the tenant's due-row ratio before setting the override MARTHA_MEMORY_ANONYMIZE_ALLOW_MASS=true.

What to NEVER do

Never delete rows from execution_audit_log. The DB trigger blocks it (by design); supersede decisions must remain auditable.
Never UPDATE memory_items SET superseded_by = ... WHERE ... from a manual SQL session unless you're in incident response and have written authorization. The activity is the only legitimate writer; a hand-rolled UPDATE bypasses the security floor (tenant + identity WHERE clauses).
Never truncate memory_items to "reset" — chats reference these rows. Use the C2-3c retention anonymizer for cleanup.

References

Spec: dev_docs/specs/agent-memory-superseded-by.md
Pre-deploy baseline: dev_docs/observability/agent-memory-c2-3b-baseline.md
Issue: #240 (parent: #231)
Predecessor runbook: agent-memory C2-2c remember_fact (no separate runbook — write path is synchronous and self-contained)

Agent memory runbook — supersede lifecycle (C2-3b) ​

What runs where ​

Healthy steady state ​

Common alerts ​

A1 — sustained fail-open rate > 5% ​

A2 — sustained lag p95 > 5min ​

A3 — supersede rate goes to zero unexpectedly ​

Manual procedures ​

Force re-judge of a single row ​

Walk a supersede chain manually ​

Rollback procedures ​

Tuning levers ​

Retention anonymizer ​

What to NEVER do ​

References ​

Agent memory runbook — supersede lifecycle (C2-3b)

What runs where

Healthy steady state

Common alerts

A1 — sustained fail-open rate > 5%

A2 — sustained lag p95 > 5min

A3 — supersede rate goes to zero unexpectedly

Manual procedures

Force re-judge of a single row

Walk a supersede chain manually

Rollback procedures

Tuning levers

Retention anonymizer

What to NEVER do

References