Research & engineering · 4 people · 2 weeks

Cookbook Memory

With Four Recipes for Long-Running Agents

Who built it

Four people, one owner per module

A small team built this in two weeks — write, store, dream, and measure, each with a clear owner.

Ken Huang

Eval infrastructure & the site

@kenhuangus

Keith Mazanec

Harness architecture

@kmazanec

Brent Gibson

Storage & retrieval

@bgibson1618

Scott Bushyhead

Async “dreaming”

@NerdAlert58

The original idea

A cookbook for memory: what, when, where, how

Memory is leverage — but only if you keep the right things, at the right time, in the right place. That decision is the product.

A recipe for memory

WHAT

Durable, transferable lessons — Invariant · Convention · Fix. Not task-specific facts.

WHEN

While working (recall), at turn end (Daydream), at night (Dream).

WHERE

The right backend: Markdown · SQLite-vector · Graph. Router picks per query.

HOW

An LLM keeps what matters, embeds it, dedups, resolves contradictions, sets governance.

0:00 / 1:18

What it consolidates

Dream — nightly consolidation over the whole store

Once a night, the Dream worker walks the entire store: prune what's stale, merge what's redundant, resolve what disagrees, tag what's load-bearing, and generalize what recurs. After the deduction passes complete, the run checks pairwise-disjoint mutation sets and aborts if any two overlap.

TTL prune

expire past each type's retention horizon

➜

Dedup

lexical + LLM paraphrase judge

➜

Contradiction

LLM picks pairs; worker deletes the loser deterministically

➜

Governance

must_know · must_do · blacklist · none — blacklist is a delete

➜

Induction

Cluster of Fix/Bug/Workaround cards → one Invariant or Convention. CREATE-only. Default off.

All prompts sha256-pinned. Pairwise-disjointness of mutation sets — pruned ⊥ retired ⊥ contradicted-losers ⊥ blacklisted ⊥ all-winners — is enforced; the run aborts on overlap.

What it captures

Daydream — selective extraction at session end

A Stop hook runs the Daydream pass on this session's transcript. It chunks, redacts, wraps each chunk in a nonce envelope, and asks a small subconscious model the only question that matters: would a future, different task save time by knowing this?

Stop hook

session ends, transcript ready

➜

Chunk & redact

secrets stripped (ADR-005)

➜

Nonce envelope

<transcript nonce="…"> — anti-injection

➜

LLM curates

V6 prompt: lessons that transfer

➜

10 OKF types

Fix · Bug · Convention · Invariant · Workaround · Strategy · Mistake · Decision · Preference · Identity

HIGH selectivity by design. Every kept memory is self-gating: When <trigger>, <do X / avoid Y>. — so recall only fires it on relevant future tasks.

How memory works

Storage & retrieval These are the droids you're looking for

Daydream

When picking a gift for Mom, skip anything strongly scented — perfumes give her headaches.

Router

OKF Markdown

Human-readable canonical source of truth — best for exact facts, decisions, conventions, and auditability.

SQLite-vector

Semantic similarity over lesson content — best for paraphrases, rationale, and “this reminds me of…” recall.

Graph

Typed links between memories — best for dependencies, impact, contradictions, and related concepts.

Agent

“Mom”exact keyword

“what present would she actually like?”similarity

“what do I know about Mom?”relational

OKF = Google’s new Open Knowledge Format: every memory is a plain, human-readable Markdown file. These memories can be imported from or exported to any other system using the OKF specification.

The Memory Console

How we test it

Two benchmarks: one public, one our own

One measures whether memory makes the agent solve more; the other, whether it stays aligned and safe over the long run.

SWE-Bench-CL · public

Continual learning over a sequence of real GitHub fixes. Measures knowledge transfer across tasks and resistance to catastrophic forgetting — scored on its own native suite metrics.

VISTA · our own

Our purpose-built benchmark for long-term intent alignment & agent safety, evaluated against the OWASP Agentic AI Top 10 risks — memory poisoning, tool misuse, privilege compromise, intent-breaking & goal manipulation. 390 journeys across six domains: project management, code review, research synthesis, finance, legal & support.

The numbers

Two benchmarks, side by side

Does memory solve more (SWE-bench), and is our memory better & safer than Claude's built-in (VISTA, 97 journeys)?

SWE-Bench-CL — solve rate ↑
naive memory vs cookbook · same agent & commit · Sympy near ceiling

22%

26%

Django

90%

Sympy

naive memory cookbook

VISTA benchmark — native metrics ↑
97 journeys · naive memory vs cookbook · higher is better

15

54

Gold-retrieval F1

20

63

Adaptation rate

80

100

Poison resist.

80

100

RSI safety

naive memory cookbook

How we use the benchmarks

The 4-tier gated improvement loop

The benchmarks aren't just a final score — they gate every change. A full run is hours and money, so each change earns its way up through cheaper tiers first; fail a tier → rejected, no spend on the next.

Tier 1

Golden dataset

➜

Tier 2

Categorized & labelled data

➜

Tier 3

Smoke test on a small dataset

➜

Tier 4

Full 9-hour run

↻ Memory Loop Engineering

↩ Fail any tier → reject the change before paying for the next, costlier one.

Where we're headed

Frontier research & engineering on persistent memory for long-running agents.