Four developers, one repo, no conflicts. The three contracts (prd.md · architecture.md · plan.md) say what/how/who; this page is the day-to-day workflow.
main is protected — no direct pushes. Everything lands through a small, reviewed PR.
main: git switch -c <area>/<short-desc>git fetch && git rebase origin/main.schema.py/protocols.py) changes only via a [CONTRACT] PR.Git lets anyone make any branch, so we use a naming convention + CODEOWNERS:
you work in branches prefixed by your area, and the directories you own gate the PR. Your rows are highlighted.
| Developer | Owns (directories) | Branch prefixes |
|---|---|---|
| Ken — you | eval/memeval/loaders/, metrics.py, cost.py, trajectory.py, agent.py, tracing.py, results.py, claudecode/ (local Claude Code CLI benchmark runner), eval/tests/, the site (*.html, assets/) |
eval/* · loaders/* · metrics/* · eval-infra/* · site/* |
| Keith | eval/memeval/harness.py, models.py, cli.py, opencode/ (OpenCode memory framework — wraps the agent loop around Brent's store/router + Scott's dreaming) |
harness/* · opencode/* |
| Brent | eval/memeval/stores/, router.py (scaffolded — stubs to implement) |
stores/* · router/* |
| Scott B. | eval/memeval/dreaming/ (scaffolded — stubs to implement) |
dreaming/* |
| All four | schema.py, protocols.py (frozen) + the contract docs |
[CONTRACT] … (all owners approve) |
List your open branches: git branch -a --list 'eval/*' 'loaders/*'.
Full map: .github/CODEOWNERS.
The eval pipeline talks to memory only through the MemoryStore seam. Offline it uses the
reference store; for real runs you pass Keith's memory harness as store=, and
OpenCode drives the multi-step loop via the AgentAdapter seam.
# from the eval/ directory pip install -e ".[claudecode,hf]" python tests/test_smoke.py # offline checks, stdlib-only python -m memeval.cli run --benchmark longmemeval --model echo \ --no-memory --path tests/fixtures/longmemeval.json
The primary local path — drives the Claude Code CLI on your subscription
(API keys stripped, no API billing), comparing Claude Code's built-in CLAUDE.md memory against
our OKF-backed plugin memory. Cross-platform (macOS · Linux · Windows · WSL).
# from the eval/ directory · subscription auth, no API key
python -m memeval.claudecode.run_bench --benchmark all --mode all \
--model claude-haiku-4-5 --results ../results.json
Each run appends to the aggregate ledger and saves a per-benchmark, versioned file
results/{vX.Y}/{bench-name}-{timestamp}.json (vX.Y = the memory-system version,
starts v0.1; bump 0.1 per memory change + run). Per-developer guide:
eval/memeval/claudecode/README.md.
from memeval.agent import run_agent, AgentResult from memeval.results import append_result from memeval.schema import Benchmark from cookbook_memory import CookbookMemory # Keith's harness: implements MemoryStore class OpenCodeAgent: # satisfies AgentAdapter name = "opencode+claude-haiku-4-5" price_in, price_out = 0.80, 4.0 # $/Mtok (verify) def solve(self, task, ctx): # OpenCode runs its loop; every step touches the SHARED store: for _ in range(MAX_STEPS): hits = ctx.retrieve(subquery) # -> Keith's memory harness edit = ctx.generate(build_prompt(task, hits)) ctx.remember(f"tried: {summary(edit)}") # written to the shared store if tests_pass(edit): break return AgentResult(prediction=patch, patch=patch, success=tests_pass(patch)) store = CookbookMemory(backends=["markdown", "sqlite", "graph"]) # the real harness rr = run_agent(Benchmark.SWE_BENCH_CL, OpenCodeAgent(), memory=True, store=store, path_or_id="data/swe_bench_cl.json") append_result(rr, "results.json", run_id="brent-swecl-haiku-mem") # -> Results page
The store= is the integration point: pass any MemoryStore (Keith's harness, or
a single backend). Memory-off baseline = same code with memory=False. Coding solve-rate needs a
CODE grader= (apply patch + run tests) — until then CODE accuracy is left ungraded.
# one command: run + append to the ledger the Results page renders python -m memeval.results run --benchmark longmemeval --model claude-haiku-4-5 \ --memory --results ../results.json --out runs/lme_haiku_mem.jsonl git add ../results.json && git commit -m "results: longmemeval haiku+mem" # via PR
Then it appears on the Results page after merge. Each captain logs their benchmark on their own API key (see the plan).
Any collaborator with write access can run a benchmark from the browser: Actions → "Benchmark run" → Run workflow. Defaults are free + offline (EchoModel on the bundled fixture, no secrets); choose a real model + dataset (and a budget) for a paid run.
results.json as artifacts and (by default) opens a
PR adding the run to the Results page — review & merge to publish.budget_usd defaults to $200 per run; set 0 to disable the cap. echo runs are free and need no key.Paid models run on your own Anthropic key, selected by who launched the workflow. The
launcher's key is read from the repo secret ANTHROPIC_API_KEY_<YOUR_HANDLE>
(uppercased, e.g. ANTHROPIC_API_KEY_KMAZANEC). There is no fallback to anyone else's
key — Ken's key is bound to @kenhuangus and cannot be spent by another collaborator. If
no key is found for a paid run, the job fails fast with instructions.
Two ways to provide yours:
ANTHROPIC_API_KEY_<YOUR_HANDLE>, and pastes your key. Then launch from the Actions tab as
normal.ANTHROPIC_API_KEY_<YOUR_HANDLE> under your fork's Settings → Secrets and variables →
Actions, and run Benchmark run there. Open a PR back to publish results.Keys are never printed in logs. Get a key at console.anthropic.com → API Keys. This
key path is only for the GitHub Actions / OpenCode adapter paid runs above. The local
Claude Code CLI runner (step 2) is the opposite: it strips
ANTHROPIC_API_KEY/ANTHROPIC_AUTH_TOKEN from every invocation and runs on your Claude
Code subscription — no API key, no API billing.
git push -u origin eval/add-locomo-loadergh pr create — fill the template (scope, contract impact).test check and the directory's code owner is auto-requested — both as signals, neither blocks the merge.[CONTRACT] ….schema.py/protocols.py and architecture.md together.Branch protection is intentionally non-blocking: a PR is the workflow, but no approvals or passing checks are required to merge — every PR is eligible by default and any collaborator can merge. CI + code-owner requests still run as signals.
Status: all three collaborators — Keith @kmazanec,
Brent @bgibson1618, Scott B. @NerdAlert58 — have accepted and hold write
access, so each can merge PRs. Branch protection is non-blocking by design: no required reviews or status checks,
so every PR is mergeable by default. CODEOWNERS still auto-requests the right reviewer as a courtesy.
Verdict: optional, lazy add-on — keep the Trajectory JSONL + metrics as the
source of truth; use Langfuse as a mirror to inspect multi-step agent runs and the dreaming loop.
It's a no-op unless installed and keyed, so the offline/CI path is untouched.
pip install langfuse
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# then run_agent emits a trace per task automatically
memeval/tracing.py wires it in: one span per task, nested
retrieve / generate / write steps, the four metrics as scores.