Overview / Collaborate

Collaborate on GitHub

Four developers, one repo, no conflicts. The three contracts (prd.md · architecture.md · plan.md) say what/how/who; this page is the day-to-day workflow.

How we branch

GitHub Flow, short-lived branches

main is protected — no direct pushes. Everything lands through a small, reviewed PR.

The loop

  1. Branch from main: git switch -c <area>/<short-desc>
  2. Commit small, one concern per branch (< ~400 lines).
  3. Sync daily: git fetch && git rebase origin/main.
  4. Push & open a PR early; CI runs as a signal (it doesn't gate the merge).
  5. Squash-merge when you're ready — any collaborator can merge; delete the branch.

Rules that prevent conflicts

  • One owner per directory — edit only paths you own.
  • Short-lived branches (1–2 days); long branches rot.
  • Stubs first — the interface lands before dependents.
  • Don't reformat files you don't own.
  • The frozen contract (schema.py/protocols.py) changes only via a [CONTRACT] PR.
Who owns what

Your branches & directories

Git lets anyone make any branch, so we use a naming convention + CODEOWNERS: you work in branches prefixed by your area, and the directories you own gate the PR. Your rows are highlighted.

DeveloperOwns (directories)Branch prefixes
Ken — you eval/memeval/loaders/, metrics.py, cost.py, trajectory.py, agent.py, tracing.py, results.py, claudecode/ (local Claude Code CLI benchmark runner), eval/tests/, the site (*.html, assets/) eval/* · loaders/* · metrics/* · eval-infra/* · site/*
Keith eval/memeval/harness.py, models.py, cli.py, opencode/ (OpenCode memory framework — wraps the agent loop around Brent's store/router + Scott's dreaming) harness/* · opencode/*
Brent eval/memeval/stores/, router.py (scaffolded — stubs to implement) stores/* · router/*
Scott B. eval/memeval/dreaming/ (scaffolded — stubs to implement) dreaming/*
All four schema.py, protocols.py (frozen) + the contract docs [CONTRACT] … (all owners approve)

List your open branches: git branch -a --list 'eval/*' 'loaders/*'. Full map: .github/CODEOWNERS.

Run the benchmark pipeline

Datasets → OpenCode memory harness → metrics

The eval pipeline talks to memory only through the MemoryStore seam. Offline it uses the reference store; for real runs you pass Keith's memory harness as store=, and OpenCode drives the multi-step loop via the AgentAdapter seam.

1 · Set up & smoke-test (offline, zero deps)

# from the eval/ directory
pip install -e ".[claudecode,hf]"
python tests/test_smoke.py                 # offline checks, stdlib-only
python -m memeval.cli run --benchmark longmemeval --model echo \
    --no-memory --path tests/fixtures/longmemeval.json

2 · Run the five benchmarks via the Claude Code CLI (built-in vs our plugin memory)

The primary local path — drives the Claude Code CLI on your subscription (API keys stripped, no API billing), comparing Claude Code's built-in CLAUDE.md memory against our OKF-backed plugin memory. Cross-platform (macOS · Linux · Windows · WSL).

# from the eval/ directory · subscription auth, no API key
python -m memeval.claudecode.run_bench --benchmark all --mode all \
    --model claude-haiku-4-5 --results ../results.json

Each run appends to the aggregate ledger and saves a per-benchmark, versioned file results/{vX.Y}/{bench-name}-{timestamp}.json (vX.Y = the memory-system version, starts v0.1; bump 0.1 per memory change + run). Per-developer guide: eval/memeval/claudecode/README.md.

3 · Run a dataset through the OpenCode memory harness (real code)

from memeval.agent import run_agent, AgentResult
from memeval.results import append_result
from memeval.schema import Benchmark
from cookbook_memory import CookbookMemory   # Keith's harness: implements MemoryStore

class OpenCodeAgent:                          # satisfies AgentAdapter
    name = "opencode+claude-haiku-4-5"
    price_in, price_out = 0.80, 4.0           # $/Mtok (verify)
    def solve(self, task, ctx):
        # OpenCode runs its loop; every step touches the SHARED store:
        for _ in range(MAX_STEPS):
            hits = ctx.retrieve(subquery)             # -> Keith's memory harness
            edit = ctx.generate(build_prompt(task, hits))
            ctx.remember(f"tried: {summary(edit)}")     # written to the shared store
            if tests_pass(edit): break
        return AgentResult(prediction=patch, patch=patch, success=tests_pass(patch))

store = CookbookMemory(backends=["markdown", "sqlite", "graph"])   # the real harness
rr = run_agent(Benchmark.SWE_BENCH_CL, OpenCodeAgent(), memory=True,
               store=store, path_or_id="data/swe_bench_cl.json")
append_result(rr, "results.json", run_id="brent-swecl-haiku-mem")   # -> Results page

The store= is the integration point: pass any MemoryStore (Keith's harness, or a single backend). Memory-off baseline = same code with memory=False. Coding solve-rate needs a CODE grader= (apply patch + run tests) — until then CODE accuracy is left ungraded.

4 · Log the run so it shows on the Results page

# one command: run + append to the ledger the Results page renders
python -m memeval.results run --benchmark longmemeval --model claude-haiku-4-5 \
    --memory --results ../results.json --out runs/lme_haiku_mem.jsonl
git add ../results.json && git commit -m "results: longmemeval haiku+mem" # via PR

Then it appears on the Results page after merge. Each captain logs their benchmark on their own API key (see the plan).

5 · Or launch it from GitHub Actions (no local setup)

Any collaborator with write access can run a benchmark from the browser: Actions → "Benchmark run" → Run workflow. Defaults are free + offline (EchoModel on the bundled fixture, no secrets); choose a real model + dataset (and a budget) for a paid run.

Bring your own key — there is no shared key

Paid models run on your own Anthropic key, selected by who launched the workflow. The launcher's key is read from the repo secret ANTHROPIC_API_KEY_<YOUR_HANDLE> (uppercased, e.g. ANTHROPIC_API_KEY_KMAZANEC). There is no fallback to anyone else's key — Ken's key is bound to @kenhuangus and cannot be spent by another collaborator. If no key is found for a paid run, the job fails fast with instructions.

Two ways to provide yours:

  1. Admin adds a repo secret (run in this repo). A repo admin opens Settings → Secrets and variables → Actions → New repository secret, names it ANTHROPIC_API_KEY_<YOUR_HANDLE>, and pastes your key. Then launch from the Actions tab as normal.
  2. Run in your own fork (no admin needed). Fork the repo, add ANTHROPIC_API_KEY_<YOUR_HANDLE> under your fork's Settings → Secrets and variables → Actions, and run Benchmark run there. Open a PR back to publish results.

Keys are never printed in logs. Get a key at console.anthropic.com → API Keys. This key path is only for the GitHub Actions / OpenCode adapter paid runs above. The local Claude Code CLI runner (step 2) is the opposite: it strips ANTHROPIC_API_KEY/ANTHROPIC_AUTH_TOKEN from every invocation and runs on your Claude Code subscription — no API key, no API billing.

Ship it

Pull requests & merge

Normal PR

  1. git push -u origin eval/add-locomo-loader
  2. gh pr create — fill the template (scope, contract impact).
  3. CI runs the test check and the directory's code owner is auto-requested — both as signals, neither blocks the merge.
  4. Review is encouraged but not required; every PR is eligible to merge by default.
  5. Squash-merge whenever you're ready — any collaborator can merge; branch auto-deletes.

Changing the frozen contract

  • Title the PR [CONTRACT] ….
  • Edit schema.py/protocols.py and architecture.md together.
  • Fill the "affected dependents" table.
  • By team convention, loop in all four owners (CODEOWNERS auto-requests them) — a courtesy for contract changes, not a hard gate.

Branch protection is intentionally non-blocking: a PR is the workflow, but no approvals or passing checks are required to merge — every PR is eligible by default and any collaborator can merge. CI + code-owner requests still run as signals.

🔑

Status: all three collaborators — Keith @kmazanec, Brent @bgibson1618, Scott B. @NerdAlert58 — have accepted and hold write access, so each can merge PRs. Branch protection is non-blocking by design: no required reviews or status checks, so every PR is mergeable by default. CODEOWNERS still auto-requests the right reviewer as a courtesy.

Optional

Langfuse tracing for the agent loop

Verdict: optional, lazy add-on — keep the Trajectory JSONL + metrics as the source of truth; use Langfuse as a mirror to inspect multi-step agent runs and the dreaming loop. It's a no-op unless installed and keyed, so the offline/CI path is untouched.

Turn it on

pip install langfuse
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# then run_agent emits a trace per task automatically

memeval/tracing.py wires it in: one span per task, nested retrieve / generate / write steps, the four metrics as scores.

Why bother

  • Nested trace tree per task (vs. a flat JSONL dump).
  • Per-step latency + retrieved items inline — find the bad retrieval.
  • Experiment diff: memory-on vs off, Haiku vs Opus, side by side.
  • Debug the dreaming loop's merge/dedup decisions step by step.