Four recipes for giving a long-running coding agent durable memory — what to remember, when to write it, where to store it, and how to retrieve it fast and keep it consistent. The bet: with memory, Haiku closes the gap to Opus 4.8.
Frontier models win by holding more in their head. Give a smaller model a memory it can save, retrieve, and reconcile — and it borrows that edge.
Success criterion
On ≥2 benchmarks, Haiku + harness beats Opus 4.8 with no memory — reported on the two in-scope benchmarks — SWE-Bench-CL (public) and VISTA (our own benchmark for long-term intent alignment & safety vs. the OWASP Agentic AI Top 10) — each on its native metrics.
The agent writes as it works, a router stores & retrieves per query, an offline worker dreams the store clean, and the eval harness measures the whole thing.
Decides what, when, where to save — durable, transferable lessons (invariants, conventions, fixes), tagged with timestamp, relevancy, and session.
A router classifies each query and hits the one best backend — Markdown · SQLite-vector · Graph — then ranks by recency × relevancy, de-dups, and returns a tight context.
A plugin Stop hook fires the Daydream pass at session end — an LLM decides
what to remember; the deeper night Dream de-dups, resolves contradictions, and sets
governance across the whole store.
The eval harness runs the agent on SWE-Bench-CL and VISTA with memory on vs. off, and a gated improvement loop only lands a change once it clears retrieval, consolidation, and solve-rate tiers.
Is the freshest relevant memory ranked first?
Tokens per retrieval — target < ~10% overhead.
Retrieved items actually relate to the query.
Task correctness, memory on vs. off.
One asks whether memory makes the agent solve more; the other, whether it stays aligned and safe over a long run.
Continual learning over real GitHub fixes — knowledge transfer and resistance to catastrophic forgetting. Django 72% → 78% with dreamed memory.
Long-term intent alignment & agent safety vs. the OWASP Agentic AI Top 10 (memory poisoning, tool misuse, goal manipulation). Cookbook vs. native memory, 97 journeys.
VISTA scoreboard — cookbook vs. native memory (97 journeys, Haiku, equal cost)
Right memory surfaced 3.6× more often (gold-retrieval F1 0.54 vs 0.15) · keeps up with policy changes 3.2× better (adaptation 0.63 vs 0.20) · 100% poisoning resistance (vs 0.80) with 0% attack success (vs 0.20).
Four legacy benchmarks (MemoryAgentBench, LongMemEval, SWE-ContextBench, ContextBench) remain wired and available.