Collaborate · Cookbook Memory

How we branch

GitHub Flow, short-lived branches

main is protected — no direct pushes. Everything lands through a small, reviewed PR.

The loop

Branch from main: git switch -c <area>/<short-desc>
Commit small, one concern per branch (< ~400 lines).
Sync daily: git fetch && git rebase origin/main.
Push & open a PR early; CI runs as a signal (it doesn't gate the merge).
Squash-merge when you're ready — any collaborator can merge; delete the branch.

Rules that prevent conflicts

One owner per directory — edit only paths you own.
Short-lived branches (1–2 days); long branches rot.
Stubs first — the interface lands before dependents.
Don't reformat files you don't own.
The frozen contract (schema.py/protocols.py) changes only via a [CONTRACT] PR.

Who owns what

Your branches & directories

Git lets anyone make any branch, so we use a naming convention + CODEOWNERS: you work in branches prefixed by your area, and the directories you own gate the PR. Your rows are highlighted.

Developer	Owns (directories)	Branch prefixes
Ken — you	`eval/memeval/loaders/`, `metrics.py`, `cost.py`, `trajectory.py`, `agent.py`, `tracing.py`, `results.py`, `claudecode/` (local Claude Code CLI benchmark runner), `eval/tests/`, the site (`*.html`, `assets/`)	`eval/` · `loaders/` · `metrics/` · `eval-infra/` · `site/*`
Keith	`eval/memeval/harness.py`, `models.py`, `cli.py`, `opencode/` (OpenCode memory framework — wraps the agent loop around Brent's store/router + Scott's dreaming)	`harness/` · `opencode/`
Brent	`eval/memeval/stores/`, `router.py` (scaffolded — stubs to implement)	`stores/` · `router/`
Scott B.	`eval/memeval/dreaming/` (scaffolded — stubs to implement)	`dreaming/*`
All four	`schema.py`, `protocols.py` (frozen) + the contract docs	`[CONTRACT] …` (all owners approve)

List your open branches: git branch -a --list 'eval/*' 'loaders/*'. Full map: .github/CODEOWNERS.

Run the benchmark pipeline

Datasets → OpenCode memory harness → metrics

The eval pipeline talks to memory only through the MemoryStore seam. Offline it uses the reference store; for real runs you pass Keith's memory harness as store=, and OpenCode drives the multi-step loop via the AgentAdapter seam.

1 · Set up & smoke-test (offline, zero deps)

# from the eval/ directory
pip install -e ".[claudecode,hf]"
python tests/test_smoke.py                 # offline checks, stdlib-only
python -m memeval.cli run --benchmark longmemeval --model echo \
    --no-memory --path tests/fixtures/longmemeval.json

2 · Run the five benchmarks via the Claude Code CLI (built-in vs our plugin memory)

The primary local path — drives the Claude Code CLI on your subscription (API keys stripped, no API billing), comparing Claude Code's built-in CLAUDE.md memory against our OKF-backed plugin memory. Cross-platform (macOS · Linux · Windows · WSL).

# from the eval/ directory · subscription auth, no API key
python -m memeval.claudecode.run_bench --benchmark all --mode all \
    --model claude-haiku-4-5 --results ../results.json

Each run appends to the aggregate ledger and saves a per-benchmark, versioned file results/{vX.Y}/{bench-name}-{timestamp}.json (vX.Y = the memory-system version, starts v0.1; bump 0.1 per memory change + run). Per-developer guide: eval/memeval/claudecode/README.md.

3 · Run a dataset through the OpenCode memory harness (real code)

from memeval.agent import run_agent, AgentResult
from memeval.results import append_result
from memeval.schema import Benchmark
from cookbook_memory import CookbookMemory   # Keith's harness: implements MemoryStore

class OpenCodeAgent:                          # satisfies AgentAdapter
    name = "opencode+claude-haiku-4-5"
    price_in, price_out = 0.80, 4.0           # $/Mtok (verify)
    def solve(self, task, ctx):
        # OpenCode runs its loop; every step touches the SHARED store:
        for _ in range(MAX_STEPS):
            hits = ctx.retrieve(subquery)             # -> Keith's memory harness
            edit = ctx.generate(build_prompt(task, hits))
            ctx.remember(f"tried: {summary(edit)}")     # written to the shared store
            if tests_pass(edit): break
        return AgentResult(prediction=patch, patch=patch, success=tests_pass(patch))

store = CookbookMemory(backends=["markdown", "sqlite", "graph"])   # the real harness
rr = run_agent(Benchmark.SWE_BENCH_CL, OpenCodeAgent(), memory=True,
               store=store, path_or_id="data/swe_bench_cl.json")
append_result(rr, "results.json", run_id="brent-swecl-haiku-mem")   # -> Results page

The store= is the integration point: pass any MemoryStore (Keith's harness, or a single backend). Memory-off baseline = same code with memory=False. Coding solve-rate needs a CODE grader= (apply patch + run tests) — until then CODE accuracy is left ungraded.

4 · Log the run so it shows on the Results page

# one command: run + append to the ledger the Results page renders
python -m memeval.results run --benchmark longmemeval --model claude-haiku-4-5 \
    --memory --results ../results.json --out runs/lme_haiku_mem.jsonl
git add ../results.json && git commit -m "results: longmemeval haiku+mem" # via PR

Then it appears on the Results page after merge. Each captain logs their benchmark on their own API key (see the plan).

5 · Or launch it from GitHub Actions (no local setup)

Any collaborator with write access can run a benchmark from the browser: Actions → "Benchmark run" → Run workflow. Defaults are free + offline (EchoModel on the bundled fixture, no secrets); choose a real model + dataset (and a budget) for a paid run.

Inputs: benchmark · model · memory on/off · dataset (blank = fixture) · dev-slice / limit · budget_usd · open-PR.
It uploads the trajectory + results.json as artifacts and (by default) opens a PR adding the run to the Results page — review & merge to publish.
budget_usd defaults to $200 per run; set 0 to disable the cap. echo runs are free and need no key.

Bring your own key — there is no shared key

Paid models run on your own Anthropic key, selected by who launched the workflow. The launcher's key is read from the repo secret ANTHROPIC_API_KEY_<YOUR_HANDLE> (uppercased, e.g. ANTHROPIC_API_KEY_KMAZANEC). There is no fallback to anyone else's key — Ken's key is bound to @kenhuangus and cannot be spent by another collaborator. If no key is found for a paid run, the job fails fast with instructions.

Two ways to provide yours:

Admin adds a repo secret (run in this repo). A repo admin opens Settings → Secrets and variables → Actions → New repository secret, names it ANTHROPIC_API_KEY_<YOUR_HANDLE>, and pastes your key. Then launch from the Actions tab as normal.
Run in your own fork (no admin needed). Fork the repo, add ANTHROPIC_API_KEY_<YOUR_HANDLE> under your fork's Settings → Secrets and variables → Actions, and run Benchmark run there. Open a PR back to publish results.

Keys are never printed in logs. Get a key at console.anthropic.com → API Keys. This key path is only for the GitHub Actions / OpenCode adapter paid runs above. The local Claude Code CLI runner (step 2) is the opposite: it strips ANTHROPIC_API_KEY/ANTHROPIC_AUTH_TOKEN from every invocation and runs on your Claude Code subscription — no API key, no API billing.

Ship it

Pull requests & merge

Normal PR

git push -u origin eval/add-locomo-loader
gh pr create — fill the template (scope, contract impact).
CI runs the test check and the directory's code owner is auto-requested — both as signals, neither blocks the merge.
Review is encouraged but not required; every PR is eligible to merge by default.
Squash-merge whenever you're ready — any collaborator can merge; branch auto-deletes.

Changing the frozen contract

Title the PR [CONTRACT] ….
Edit schema.py/protocols.py and architecture.md together.
Fill the "affected dependents" table.
By team convention, loop in all four owners (CODEOWNERS auto-requests them) — a courtesy for contract changes, not a hard gate.

Branch protection is intentionally non-blocking: a PR is the workflow, but no approvals or passing checks are required to merge — every PR is eligible by default and any collaborator can merge. CI + code-owner requests still run as signals.

🔑

Status: all three collaborators — Keith @kmazanec, Brent @bgibson1618, Scott B. @NerdAlert58 — have accepted and hold write access, so each can merge PRs. Branch protection is non-blocking by design: no required reviews or status checks, so every PR is mergeable by default. CODEOWNERS still auto-requests the right reviewer as a courtesy.

Optional

Langfuse tracing for the agent loop

Verdict: optional, lazy add-on — keep the Trajectory JSONL + metrics as the source of truth; use Langfuse as a mirror to inspect multi-step agent runs and the dreaming loop. It's a no-op unless installed and keyed, so the offline/CI path is untouched.

Turn it on

pip install langfuse
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# then run_agent emits a trace per task automatically

memeval/tracing.py wires it in: one span per task, nested retrieve / generate / write steps, the four metrics as scores.

Why bother

Nested trace tree per task (vs. a flat JSONL dump).
Per-step latency + retrieved items inline — find the bad retrieval.
Experiment diff: memory-on vs off, Haiku vs Opus, side by side.
Debug the dreaming loop's merge/dedup decisions step by step.