Codename: Cookbook Memory · Research & engineering project · 4 people · 2 weeks

Cookbook Memory with Four Recipes for Long-Running Agents

Four recipes for giving a long-running coding agent durable memory — what to remember, when to write it, where to store it, and how to retrieve it fast and keep it consistent. The bet: with memory, Haiku closes the gap to Opus 4.8.

View the 2-week plan → See the architecture

Core modules: write · store & retrieve · dream · measure

Storage backends: Markdown · SQLite-vector · Graph

In-scope benchmarks: SWE-Bench-CL + VISTA (4 legacy kept available)

Native

Per-benchmark native metrics (CL suite · poisoning · RSI safety)

The hypothesis

Memory is leverage. We measure how much.

Frontier models win by holding more in their head. Give a smaller model a memory it can save, retrieve, and reconcile — and it borrows that edge.

🎯

Success criterion

On ≥2 benchmarks, Haiku + harness beats Opus 4.8 with no memory — reported on the two in-scope benchmarks — SWE-Bench-CL (public) and VISTA (our own benchmark for long-term intent alignment & safety vs. the OWASP Agentic AI Top 10) — each on its native metrics.

How it works

Four modules, three backends, one loop

The agent writes as it works, a router stores & retrieves per query, an offline worker dreams the store clean, and the eval harness measures the whole thing.

Module 1 · Write

💾 Write

Decides what, when, where to save — durable, transferable lessons (invariants, conventions, fixes), tagged with timestamp, relevancy, and session.

Module 2 · Store & retrieve

🧭 Store & retrieve

A router classifies each query and hits the one best backend — Markdown · SQLite-vector · Graph — then ranks by recency × relevancy, de-dups, and returns a tight context.

Module 3 · Dream (async)

🌙 Dream

A plugin Stop hook fires the Daydream pass at session end — an LLM decides what to remember; the deeper night Dream de-dups, resolves contradictions, and sets governance across the whole store.

Module 4 · Measure

📊 Measure

The eval harness runs the agent on SWE-Bench-CL and VISTA with memory on vs. off, and a gated improvement loop only lands a change once it clears retrieval, consolidation, and solve-rate tiers.

Full architecture & data flow →

What we measure

The four memory metrics

⏱️

Recency

Is the freshest relevant memory ranked first?

⚡

Efficiency

Tokens per retrieval — target < ~10% overhead.

🎯

Relevancy

Retrieved items actually relate to the query.

✅

Accuracy

Task correctness, memory on vs. off.

How we test it

Two benchmarks: one public, one our own

One asks whether memory makes the agent solve more; the other, whether it stays aligned and safe over a long run.

Public

SWE-Bench-CL

Continual learning over real GitHub fixes — knowledge transfer and resistance to catastrophic forgetting. Django 72% → 78% with dreamed memory.

Our own benchmark

VISTA

Long-term intent alignment & agent safety vs. the OWASP Agentic AI Top 10 (memory poisoning, tool misuse, goal manipulation). Cookbook vs. native memory, 97 journeys.

📊

VISTA scoreboard — cookbook vs. native memory (97 journeys, Haiku, equal cost)

Right memory surfaced 3.6× more often (gold-retrieval F1 0.54 vs 0.15) · keeps up with policy changes 3.2× better (adaptation 0.63 vs 0.20) · 100% poisoning resistance (vs 0.80) with 0% attack success (vs 0.20).

Benchmarks & datasets →

Four legacy benchmarks (MemoryAgentBench, LongMemEval, SWE-ContextBench, ContextBench) remain wired and available.