Skip to content

Agent memory

Martha's platform memory layer (issue #231 C1) lets agents recall past chat messages and tool outputs from the same session via an explicit recall tool. This guide covers the user-facing surface; for design context see dev_docs/specs/agent-memory-platform-recall.md.

What it does

Long agent conversations grow beyond what fits in the model's context window. Even with prompt caching (which reduces cost), older turns drop out of the active context entirely. The memory layer keeps a hybrid-searchable record of every chat message and every tool result, so the agent can ask "what was the result of that Metabase query 15 turns ago?" via the recall tool — without you having to scroll back or re-run the query.

How agents use it

The recall tool is registered as a platform function. Every agent that has access to platform functions gets it automatically. The agent calls it like any other tool:

json
{
  "name": "recall",
  "arguments": {
    "query": "ANCHOR_TOKEN_7a3f9 metabase order count",
    "top_k": 5,
    "source_kinds": ["tool_output"]
  }
}

Parameters

ParameterTypeDefaultDescription
querystring(required)Natural-language search query. Both keyword and semantic similarity are matched.
top_kinteger5Number of items to return (max 20).
source_kindslist of strings(all)Filter by source kind. Supported: chat_message, tool_output, document_chunk. Omit to search all kinds.
scopestring"any"One of "session" (this chat only), "tenant" (uploaded documents only), "user" (explicit cross-session facts for the same user — see below), "agent" (explicit cross-session facts written by this agent), or "any" (default — UNIONs all four via weighted Reciprocal Rank Fusion).
enable_rerankbooleantrueApply cross-encoder rerank for higher precision. Off only if you specifically want raw hybrid scores.

Scope details (C2-2b)

C2-2b widens recall from session+tenant to four scope classes:

  • session — chat messages and tool outputs from the current chat session. Bounded by the request's session_id.
  • tenant — uploaded document chunks tagged tenant-wide. Bounded by tenant only; no per-session restriction. Indexed via the C2-2a doc-chunk write-through.
  • user — explicit user-scope facts (cross-session) within this tenant. Requires the request to carry an authenticated human user (Keycloak sub claim). Service-account chats can't use this scope and get a 401. Note: chat messages are NOT visible via scope='user' even if their user_id column is populated — that's a separate feature (cross-session-by-user recall, tracked at issue #237). scope='user' matches only rows that were written explicitly with scope='user' — today, that's the remember_fact platform function (C2-2c).
  • agent — explicit agent-scope facts written by THIS agent across all sessions in the tenant. Requires the request to have an agent_id (an agent must be attached to the chat). Returns 401 otherwise.
  • any (default) — UNIONs all four sources via weighted Reciprocal Rank Fusion. Static class priors bias session items slightly higher than tenant items in the fail-open path (when cross-encoder rerank is unavailable). The pattern is documented in dev_docs/solutions/cross-source-rank-fusion.md.

Tenant isolation is unconditional in every scope: a recall request with tenant T cannot ever return rows from tenant T'.

Tuning class priors

The static class priors used for weighted RRF are env-overridable for per-tenant ops tuning:

MARTHA_RECALL_WEIGHT_SESSION=1.3   # default
MARTHA_RECALL_WEIGHT_USER=1.1      # default
MARTHA_RECALL_WEIGHT_AGENT=1.0     # default
MARTHA_RECALL_WEIGHT_TENANT=1.0    # default

Negative or non-numeric values fall back to the default. Setting a weight to 0 effectively suppresses that source class. Adaptive (query-aware) weighting is deferred to issue #238 — it requires per-tenant judgment data that isn't available in C2-2b.

Response

json
{
  "items": [
    {
      "id": "uuid",
      "source_kind": "tool_output",
      "source_ref": "tool_call_id_xyz",
      "content": "...",
      "event_time": "2026-05-02T14:23:01.234567+00:00",
      "score": 0.872631
    }
  ],
  "total": 1,
  "degraded": false,
  "rerank_used": true
}

degraded: true indicates that some rows are still being embedded in the background — semantic recall may be incomplete; keyword recall is still complete. Recall remains usable in this state.

rerank_used: false indicates the cross-encoder rerank was skipped (either disabled or fell through on a TEI failure). Items are still returned, just with merge-order scores — no functional regression, just slightly less precise ranking.

Tying back to elided tool outputs

When a tool returns a large payload, Slice A (#208) elides it behind a tool_output_key (e.g. tout_abc...). The memory layer stores the same elided preview, so recall returns the preview. To expand the full payload, the agent calls read_tool_output(tool_output_key="tout_abc...") from Slice A — same key, full content.

What's stored

SourceKeyNotes
chat_messageChatMessage.idOne row per user message + one per assistant response. Chat history in chat_messages is the source of truth; memory_items carries a recallable copy capped at 16 KiB.
tool_outputtool_call_idOne row per tool call. Stores the elided preview if the result was oversized; the full payload lives in tool_outputs (Slice A).
document_chunk (C2-2a)document_chunks.idOne row per chunk. Written when ingestion finalizes a revision. Stale revisions are soft-deleted by finalize_ingestion_activity (latest revision only). Embedding is reused from the ingestion pipeline — no double-embed cost.
fact (C2-2c)fact:{scope}:{identity}:{sha256(content)[:16]}One row per remember_fact call. Always cross-session (session_id is NULL). scope='user' rows carry user_id; scope='agent' rows carry agent_id. Idempotent: same (identity, content) re-asserts to the same row via the deterministic source_ref plus the existing UNIQUE constraint.

C2 follow-ups will add: session_summary (sparing summaries triggered at 80% window).

Rerank (C2-3a)

Recall over-fetches up to N=20 candidates and passes them to a cross-encoder reranker for final ordering. Default model: Alibaba-NLP/gte-multilingual-reranker-base (306M parameters, encoder-only mGTE, 70+ languages, Apache 2.0). Served by a martha-rerank Docker container running Hugging Face TEI.

Fail-open contract. Any rerank failure (TEI down, timeout, malformed response) returns merge-order results plus a structured WARN log. Recall always succeeds; rerank is a quality upgrade, never a correctness gate.

Configuration. Set MARTHA_RERANKER_URL=http://martha-rerank:80 in .env to enable rerank for the deployment. Unset the variable to disable rerank entirely (recall falls back to merge-order). Swap models by overriding the container's --model-id argument; alternatives include BAAI/bge-reranker-v2-m3 (higher quality multilingual, ~2x slower) and Alibaba-NLP/gte-reranker-modernbert-base (English-only, ~2x faster).

Latency. Steady-state recall p50 ≈ 300-400ms with rerank, p95 ≈ 500-600ms. Cold-start first-recall after deploy ≈ 200-300ms (TEI JIT warmup), then steady. The container's first-ever start downloads ~600MB from HF Hub; the deploy script pre-warms it to keep recall fast on the first post-deploy call.

Tenant isolation

Every recall query is scoped to the calling agent's tenant. Cross-tenant access is blocked at the SQL layer (every WHERE clause filters by tenant_id). The agent cannot recall content from other tenants even if it injects a tenant_id parameter — the value is sourced exclusively from the session context.

Embedding cadence

New rows are embedded asynchronously by a long-running Temporal dispatcher workflow (MemoryEmbedDispatcherWorkflow):

  • Best case (~500ms): remember() posts a Temporal signal that wakes the dispatcher within milliseconds. Most async write paths (agent loop tool elision) hit this case.
  • Worst case (~5s): signal is dropped or the writer is sync (chat persist). The dispatcher's fallback tick catches the row.

Until embedded, rows are recallable via the BM25 (keyword) half of hybrid search. The degraded flag in the response indicates an incomplete vector half.

Compressed mode (C2-1)

Issue #231 C2-1 adds a budget-gated trim to build_active_context. When the previous turn's usage.prompt_tokens exceeds 80% of the model's context window, the composer switches from a 20-message tail to a 4-message tail. The agent then reaches older content via the existing recall tool and read_tool_output.

Behavior is invisible to end users — long sessions just keep working instead of blowing the context window. From the agent's perspective:

  • Below 80% of model window: full recent-turns history is in-prompt (today's behavior, unchanged).
  • Above 80%: only the last 4 messages stay in-prompt. To reference earlier content, the agent calls recall(query="...") and reads what the tool returns.

The recall tool description tells the agent explicitly: when the conversation has been trimmed for length, call recall before answering — don't guess. In practice agents reliably reach for the tool in this regime.

Cache impact

There is exactly one prefix-bust event per compression epoch (the turn the threshold trips on). Subsequent turns share a stable prefix until the next threshold trip. For typical session lengths this is far less cache churn than auto-injection of retrieved memory on every turn (the alternative we considered and rejected).

Operating note

If you need to drive threshold-cross from a short conversation in development or testing, the env var MEMORY_BUDGET_MODEL_WINDOW_OVERRIDE=8000 (or any value) overrides the model-window lookup. The api/worker emit a startup warning when this is set — production must NOT set it.

Contradiction handling (C2-3b)

When the agent calls remember_fact with content that contradicts an earlier user-scope or agent-scope fact, the platform retires the older fact in the background (within ~30s on a healthy worker). The retired fact is invisible to default recall — the agent never sees the older version once supersession lands.

The agent surface is exactly what C2-2c shipped: remember_fact(content, scope, metadata?) returns {id, was_new}. There is no agent-facing supersede tool, no superseded field on the envelope, no include_superseded parameter on recall — the platform makes the supersession decision.

Internally, an asynchronous Temporal activity (judge_memory_items_activity) drains pending fact rows after they're embedded. For each pending row, the activity:

  1. Looks up vector-similar live candidates within the same (tenant, scope, identity) bucket — never across tenants, never across users, never across agents.
  2. Calls a small/fast judge LLM (default claude-haiku-4-5) to decide for each candidate: SUPERSEDE or KEEP.
  3. Applies SUPERSEDE atomically — the older row's superseded_by points to the new row, plus an immutable execution_audit_log entry records the decision (IDs only, never content).
  4. Sets judge_at = NOW() on every processed row, including fail-open paths, so the dispatcher pending SELECT cannot loop.

Fail-open contract. Any judge failure (timeout, parse error, model unavailable) leaves all rows in their pre-judge state — judge_at is still set so processing isn't retried, no supersede applies. The system never silently retires data on judge errors.

Admin / audit access. The Python recall(..., include_superseded=True) kwarg (in core.memory_index.search) returns retired rows alongside live ones. This is reachable only via Python code — admin UIs, audit jobs, dashboards — and is intentionally absent from the agent-facing JSON Schema.

Tuning. Four env vars control the judge:

  • MARTHA_MEMORY_JUDGE_MODEL=claude-haiku-4-5 — empty string disables the judge (no-op judge_at setter; rollback signal).
  • MARTHA_MEMORY_JUDGE_TOPK=5 — candidates passed to the judge per pending row.
  • MARTHA_MEMORY_JUDGE_THRESHOLD=0.7 — cosine-distance cutoff (max(0.0, min(1.0, value))); rows with cosine_distance < threshold are candidates. Distance, not similarity: contradiction pairs sit at distance ~0.65 with current embedding models — a similarity threshold would starve the judge.
  • MARTHA_MEMORY_JUDGE_BATCH_SIZE=20 — pending rows processed per dispatcher tick.

See agent-memory-runbook.md for the operator runbook (fail-open detection, judge_at clear procedure, batch-size tuning) and dev_docs/observability/agent-memory-c2-3b-baseline.md for the pre-deploy validation set.

Retention anonymization (C2-3c)

Retired memory rows are anonymized, not hard-deleted. The retention worker replaces content with [REDACTED], clears identity and embedding columns, suffixes source_ref with |anonymized, sets anonymized_at, and writes a memory.anonymize audit-log event with IDs/reasons only. Rows stay in memory_items so superseded_by chains remain intact.

Default recall always excludes anonymized rows via _apply_scope_filter. This is stricter than include_superseded=True: admin/audit callers may include superseded rows, but anonymized rows never re-enter recall.

Per-tenant policy lives in tenant_config.settings.memory_retention; missing or invalid config falls back to defaults. The worker runs in dry-run mode when MARTHA_MEMORY_ANONYMIZE_DRY_RUN=true and refuses large tenant sweeps via the safety floor.

Tuning.

  • MARTHA_MEMORY_ANONYMIZE_SOFT_TTL_DAYS=30
  • MARTHA_MEMORY_ANONYMIZE_SUP_TTL_DAYS=90
  • MARTHA_MEMORY_ANONYMIZE_BATCH_SIZE=100
  • MARTHA_MEMORY_ANONYMIZE_DRY_RUN=false in dev; production should start with true for the first observation window.

Out of scope (deferred to later C2 stages)

  • Admin UI + observability dashboard for cross-source recall and remember_fact corpus health (C2-4, #242).
  • Per-tenant rerank model selection.
  • LLM-based summarization at threshold (replaced by agent-driven recall in C2-1; revisit only if measurement shows narrative-thread loss).

See dev_docs/specs/agent-memory-platform-recall.md for the full C2 scope outline, dev_docs/specs/agent-memory-budget-trim.md for the C2-1 design, and dev_docs/specs/agent-memory-doc-chunks-rerank.md for the C2-2a + C2-3a design. Cleanup follow-up is tracked in #233.

  • Document Tools — for searching uploaded document corpora (different store, different lifecycle).
  • Citations — for formal source references from agents to document chunks.

Martha is built by aiaiai-pt.