Overview / Architecture

Architecture

As built, end to end: run_bench drives the Claude Code CLI once per journey in one of three memory modes; a recall step reads memory (native files or the cookbook-memory MCP); the output is scored by a native evaluator (VISTA or SWE-Bench-CL). On session end a Stop hook fires the daydream writer, which extracts memories with an OpenRouter model and persists them through a router over three storage backends.

EVAL DRIVER · memeval.claudecode run_bench Claude Code CLI · subscription auth mode: off · builtin · plugin-real ClaudeCodeAgent one journey / task per run QA turn · agentic CODE checkout --plugin-workers · serialized bundle build Native Evaluator scores recorded trajectories VISTA · SWE-Bench-CL + host grader for CODE tests Recall / retrieve step builtin → native Grep/Read over sessions/ plugin-real → cookbook-memory MCP `recall` drive recall notes output WRITE / CONSOLIDATION · Stop hook → daydream Stop hook ASYNC · fail-open fires on session end daydream-cli `daydream` per session `dream --all` consolidation transcript_formatter noise filter OpenRouter LLM extract: what to remember ling-2.6-flash (default) → _extract → MemoryItems RouterStore classify → route dedup + write-routing default profile: fusion (RRF) Router · one MemoryStore facade STORAGE BACKENDS · per-store index MarkdownStore OKF notes · source-of-truth base Index: inverted keyword map markdown/ · always written SqliteVectorStore embed → cosine similarity memory.db · brute-force (default) opt-in: MiniLM + sqlite-vec ANN GraphStore OKF links · seed-then-traverse graph.db · typed traversal paid seam: Neo4j (uri=) PROFILES & OPT-IN BACKENDS Profile auto-selected by build_store: fusion (RRF) offline default · accuracy when VOYAGE_API_KEY · accuracy-local via MEMEVAL_LOCAL_ANN=1 · speed only when forced. Track C — local MiniLM + sqlite-vec ANN: opt-in inside the vector store (accuracy-local). Track D — Fts5Store lexical: implemented but not yet wired into build_store's backend set. The router can also run a graph→vector cascade and a cross-backend reranker; the write path always persists the markdown base.
Eval drive (run_bench → agent) Recall / read & score Async write (Stop hook → daydream) Router → backends

Top band: the eval loop — run_bench drives one Claude Code journey per task, a recall step reads memory, and a native evaluator scores the recorded trajectory. Bottom bands: the asynchronous write path — the Stop hook fires the daydream writer, which extracts memories via OpenRouter and persists them through the router over the markdown / vector / graph backends.

Three flows

How data moves

① Recall (read)

Before answering, the agent recalls. In builtin mode that is Claude Code's own Grep/Read over the laid-down sessions/ files; in plugin-real mode it is the cookbook-memory MCP recall tool, which routes the query to its backend(s).

② Write (daydream)

When the session ends, a Stop hook fires daydream-cli. It reads the new transcript delta, filters noise, asks an OpenRouter model what to remember, and writes each extracted MemoryItem through the router. dream --all runs the nightly consolidation.

③ Score

The run's recorded trajectories are scored by the benchmark's native evaluator: VISTA (poisoning resistance, targeted ASR, gold-retrieval F1, adaptation, RSI safety) and SWE-Bench-CL (forgetting / BWT / FWT / AULC). CODE tasks are graded by running the repo's tests.

The dispatch layer

One router, selectable profiles

build_store assembles the three backends behind a single RouterStore and picks a routing profile — the plugin never sees a backend or an embedder. The profile decides whether a query single-routes, fans out, or runs a cascade.

Classify & route

A classifier scores each query and picks a backend:

  • Relationship / dependency / "what breaks" → Graph traversal.
  • Conceptual / "why" / summarize → SQLite vector search.
  • Literal keyword / signature / identifier → Markdown inverted index.

The default rule-based classifier is deterministic and stdlib-only; the accuracy profiles swap in a semantic, exemplar-based classifier over a real embedder.

Profiles

  • fusion — offline default: fan out and merge with Reciprocal Rank Fusion (best zero-dependency recall).
  • accuracy — when VOYAGE_API_KEY is set: semantic classifier + Voyage embeddings + graph→vector cascade.
  • accuracy-localMEMEVAL_LOCAL_ANN=1: MiniLM + sqlite-vec ANN.
  • speed — the bare v1 single-route router; only when explicitly forced.
🧭

The router owns both read and write orchestration: dedup-on-write and write-routing (the markdown base is always persisted), plus routed reads with optional cascade, fusion, and reranking.

Fast retrieval

Indexing per backend

Each store keeps its own index so the router's chosen backend returns candidates in milliseconds — not by scanning everything.

SqliteVectorStore

Vector search

Embeds on write into memory.db. The stdlib default uses a char-n-gram hashing embedder with brute-force cosine; the opt-in accuracy-local path swaps in MiniLM embeddings and a sqlite-vec ANN index with exact rerank, and accuracy uses Voyage embeddings.

MarkdownStore

Inverted keyword index

OKF-native markdown notes plus a keyword → file inverted index for literal recall. This is the always-written source-of-truth base, so a memory survives even if the other indexes are rebuilt.

GraphStore

Typed traversal

An OKF-link graph persisted to graph.db: seed from the query, then traverse typed edges bounded by depth. The paid path swaps in a Neo4j-backed store behind the same interface via a uri= seam.

Schemas, APIs & the eval protocol →

Latest decisions · 2026-06-25

How it runs & consolidates

The whiteboard view: a Conscious band (the live, in-loop session) over a Subconscious band (async consolidation), with the plugin as the only surface the coding harness sees. Full contract in architecture.md §7.

Cookbook Memory Conscious / Subconscious system diagram

① Benchmarks run via the local CLI

Benchmarks run through the Claude Code CLI (subscription auth — API keys stripped, no API billing), comparing Claude Code's builtin memory vs the shipping plugin-real cookbook-memory plugin. Entry point python -m memeval.claudecode.run_bench; the in-scope benchmarks are VISTA and SWE-Bench-CL.

Each run saves a per-benchmark, versioned file: results/{vX.Y}/{bench-name}-{timestamp}.jsonvX.Y is the memory-system version (MEMORY_VERSION, starts v0.1; bump 0.1 per memory change + run).

② Plugin + Stop hook → Daydream

The memory system ships as a Claude Code plugin (skills · MCP · hooks). A Stop hook fires the Daydream component when a session ends, so in-session consolidation runs automatically — no manual trigger.

Two memory-creation paths: the model's in-loop remember tool, and the Daydream pass mining the logs for what wasn't saved. Fail-open throughout.

③ Daydream consolidation flow

  • Read the session .jsonl logs.
  • Track processed-vs-delta state; filter noise, then send only the unprocessed delta.
  • Call an OpenRouter model (cheap, non-frontier; defaults to ling-2.6-flash) to decide what to remember.
  • Write each extracted memory through the RouterStore — classify · route · dedup · version.