Overview / Results

Results & metrics

Every benchmark run is logged to the aggregate results.json ledger (rendered here) and saved as a per-benchmark, versioned file: results/{vX.Y}/{bench-name}-{timestamp}.json. Run → ledger + versioned file → commit → this page shows it.

🎯

Suite focus: two in-scope benchmarks — SWE-Bench-CL (primary, continual learning) and VISTA (2nd: foresight × safety, memory poisoning / adaptation) — each shown with its own native metrics. The four legacy memory benches stay available below.

No runs logged yet

Run one and it appears here automatically: python -m memeval.claudecode.run_bench --benchmark longmemeval --mode all --model claude-haiku-4-5 --results ../results.json (or the offline echo path python -m memeval.results run --benchmark longmemeval --model echo --no-memory --path tests/fixtures/longmemeval.json --results ../results.json), then commit results.json.

How to run the benchmarks (built-in vs our plugin memory, via the Claude Code CLI, subscription auth — no API key): python -m memeval.claudecode.run_bench --benchmark all --mode all --model claude-haiku-4-5 --results ../results.json. This appends to the aggregate ledger and writes per-benchmark versioned files under results/v0.1/. The offline echo path still works too: python -m memeval.results run --benchmark <name> --model echo --memory --results ../results.json. Raw ledger: results.json.