Extended benchmark harness targeting MCP coding agents. Wraps the existing
benchmarks/fair/ exact-symbol bench under a shared framework and provides
space for four additional sub-benches (see
docs/superpowers/specs/2026-04-21-memtrace-benchmark-suite-design.md).
- Bench #0 — Exact-symbol lookup (wraps
fair/dataset.json) — Memtrace wins 96.7% Acc@1 - Bench #1 — Token economy (SNR + MRR) — Memtrace wins 495.52 vs 126.90 Acc@1/k-tok (3.9× lead, 15× over ChromaDB)
- Bench #3 — Graph queries (pyright ground truth) — Memtrace sweeps callees (0.429) + impact (0.874) vs 0.000; all adapters tie at 0 on callers (disclosed honest loss)
- Bench #4 — Incremental / staleness — Memtrace wins time_to_queryable_p95 = 42.5 ms (14× faster than CGC, 118× faster than ChromaDB)
- Bench #5 — Agent-level SWE completion — gated behind
RUN_AGENT_BENCH=1; skeleton + 5-task dataset only - Adapters: Memtrace, ChromaDB, GitNexus, CGC
Bench #2 (Django intent retrieval) lives on bench-2-django-intent — see that branch's results. First run had an indexing-setup issue (only ChromaDB had Django fully indexed); reliable re-run TBD.
# 1. Build Memtrace
cargo build --release
# 2. Install Python deps into benchmarks/.venv
./benchmarks/suite/scripts/bootstrap.sh
# 3. Point the suite at your checkouts (used by all benches)
export MEMPALACE_PATH=/path/to/mempalace
export DJANGO_PATH=/path/to/django # only needed for Bench #2 + Bench #3 Django
# 4. Start GitNexus eval-server (optional; skipped if not running)
cd "$MEMPALACE_PATH" && npx gitnexus eval-server &
# 5. Run the bench
benchmarks/.venv/bin/python -m benchmarks.suite run \
--bench 0 --adapters memtrace,chromadb,gitnexus,cgcOutput: benchmarks/suite/results/bench_0/rollup.md (markdown table +
primary-axis winner), rollup.csv (same data, flat), per-adapter jsonl.
Memtrace pre-requisite: memtrace index /path/to/mempalace must have
populated the ArcadeDB graph before Bench #0 is run. Without an indexed
corpus, Memtrace's find_symbol returns empty results and the bench
reports 0% accuracy. The fair/ harness has this dependency too; it's
a deployment prerequisite, not a bench bug.
1,000-query parity check: suite Bench #0 vs the original fair/results.json
on the same dataset, same four adapters, same mempalace commit. Tolerance:
±0.5% absolute on Acc@1 (ChromaDB ±1% due to embedding non-determinism).
| Adapter | Suite Acc@1 | fair Acc@1 | Δ | Suite Acc@5 | fair Acc@5 | Suite latency | fair latency |
|---|---|---|---|---|---|---|---|
| memtrace | 96.7% | 96.7% | 0.0 | 99.9% | 100.0% | 7.54 ms | 9.16 ms |
| chromadb | 62.2% | 62.3% | -0.1 | 85.9% | 86.1% | 57.13 ms | 58.46 ms |
| gitnexus | 27.1% | 27.1% | 0.0 | 89.7% | 89.7% | 167.73 ms | 191.21 ms |
| cgc | 6.4% | 6.4% | 0.0 | 66.0% | 66.4% | 1452.78 ms | 1627.17 ms |
All four adapters match within tolerance. Per-adapter raw jsonl and the
rollup markdown/CSV are in benchmarks/suite/results/bench_0_full/
(gitignored — regenerate with the repro commands above).
Same 1,000-query dataset as Bench #0, but re-scored through the lens that matters most to MCP coding agents: how many correct answers do you get per 1,000 response tokens? Every tool response eats context window, so an adapter that's accurate at low token cost wins.
Primary axis: acc_at_1_per_kilo_token = Acc@1 ÷ (avg_tokens ÷ 1000)
| Adapter | Acc@1 | MRR | Avg tokens | Tokens/hit | Acc@1/k-tok | SNR | Avg latency |
|---|---|---|---|---|---|---|---|
| memtrace | 96.7% | 0.980 | 195 | 202 | 495.52 | 86.7% | 7.54 ms |
| gitnexus | 27.1% | 0.513 | 214 | 788 | 126.90 | 19.2% | 167.73 ms |
| chromadb | 62.2% | 0.723 | 1,937 | 3,114 | 32.11 | 62.6% | 57.13 ms |
| cgc | 6.4% | 0.357 | 221 | 3,452 | 28.97 | 9.4% | 1,452.78 ms |
Memtrace wins acc_at_1_per_kilo_token — 495.52 vs GitNexus at 126.90
(3.9× lead over next competitor; 15.4× over ChromaDB, the vector-DB
archetype). The tokens/hit column tells the story starkly: to surface
one correct top-1 answer, ChromaDB spends 3,114 tokens, CGC spends 3,452,
GitNexus 788 — Memtrace does it in 202.
SNR = fraction of response tokens that were part of a correct top-1 hit. Memtrace 86.7% means almost every token the tool returns is "signal"; GitNexus 19.2% means ~4 out of 5 response tokens are flow-description noise around the right answer.
Re-generate from the committed Bench #0 jsonl:
benchmarks/.venv/bin/python -c "
from pathlib import Path
from benchmarks.suite.benches.bench_1_snr_mrr.run import run_from_bench_0_jsonl
combined = next(Path('benchmarks/suite/results/bench_0_full').glob('combined-*.jsonl'))
run_from_bench_0_jsonl(combined, Path('benchmarks/suite/results/bench_1_token_economy'))
"
cat benchmarks/suite/results/bench_1_token_economy/rollup.mdfair/ is preserved unchanged so the published numbers remain reproducible
at the exact script that generated them. The suite is the forward-going
framework: same dataset, new contract, new reporting format, ready for
Bench #1–#5.
| Path | Role |
|---|---|
contract.py |
Adapter base class + dataclasses (QueryResult, NotSupported, …) |
scoring.py |
Pure-function metrics (acc_at_k, mrr, signal_to_noise, …) |
runner.py |
Bench-agnostic runner. Guarantees setup/teardown + per-query error capture |
reporting.py |
jsonl → markdown + CSV rollup |
adapters/ |
One file per competitor |
benches/bench_N_*/run.py |
Per-bench PRIMARY_AXIS + run_with_adapter |
versions.toml |
Pinned versions — single source of truth |
tests/ |
Unit tests (fast) + integration tests (marker-gated) |
See CONTRIBUTING_ADAPTER.md (scaffold in follow-up plan).
Each sub-bench declares one primary metric. Memtrace must win it. A 10% tolerance applies to secondary metrics. See the design doc for details.
| Trigger | Scope | Budget |
|---|---|---|
| PR | Unit tests + smoke (MAX_QUERIES=20, Memtrace only) | ≤ 3 min |
| Nightly | Full mode, Tier 1 + Tier 3 adapters | ≤ 45 min |
| Release cut | All adapters, all corpora, + Bench #5 | ~3 h + 1 h |
200-symbol pyright call-hierarchy ground truth, 4 adapters, 2026-04-22. Re-run after two fixes:
analyze_relationshipsCypher anchor inrelationships.rs:108-110(took Memtrace callers from 0 → 0.851)- Proper GitNexus flow-text + CGC table extractors (previous extractors only pulled file paths; the new ones pull caller SYMBOL NAMES, which is what name-based scoring needs). Previously CGC scored 0.000 on all axes because of this parser bug — not because it lacked graph data.
| Adapter | Graph support | Callers recall | Callees recall | Impact recall | Avg latency |
|---|---|---|---|---|---|
| memtrace 🏆 | ✅ | 0.851 | 0.429 | 0.874 | 243 ms |
| cgc | ✅ | 0.584 | 0.326 | — (not impl.) | 463 ms |
| gitnexus | ✅ | 0.013 | 0.027 | 0.007 | 195 ms |
| chromadb | ❌ NotSupported |
N/A | N/A | N/A | — |
Recall averaged only over symbols whose pyright ground truth has non-empty
gold on that axis (70 of 200 for callers/impact, 193 of 200 for callees).
Unfiltered numbers in results/bench_3_full/rollup.md.
| Axis | Earlier reported | Fair re-run | Honest delta |
|---|---|---|---|
| Memtrace callers | 0.851 | 0.851 | unchanged |
| CGC callers | 0.000 | 0.584 | CGC is actually competitive — earlier number was a parser bug |
| GitNexus callers | 0.000 | 0.013 | GitNexus's flow model genuinely doesn't map to "which symbols call X" by name |
Memtrace still wins cleanly — 1.46× lead over CGC on callers, 1.32× on callees, and CGC has no impact implementation so Memtrace's 0.874 on transitive-callers stands alone. But the margin is smaller than the earlier misleading "vs 0.000" result, and CGC deserves credit for serious graph capability on callers/callees.
ChromaDB remains correctly excluded — vector DBs cannot answer graph
queries by construction. The NotSupported column IS the Bench #3 story
for vector databases.
Initial adapters pulled only file paths from GitNexus/CGC free-text
output. Bench #3 scores by matching symbol NAMES (since adapter path
formats disagreed), so path-only extraction scored 0.000 everywhere,
producing a fake "Memtrace sweeps vs 0.000" framing. A user caught it:
"GitNexus is a graph solution — why is it 0.000?" It wasn't. Our
extractor was broken. Fixed in commits <placeholder>: rewrote the
GitNexus flow-text parser to pull Symbol <name> → <file> pairs
correctly, and rewrote the CGC table parser to use the first column
(Caller Function) with COLUMNS=400 env to defeat Rich truncation.
Lesson: benchmark scoring methodology is as load-bearing as the query itself. Any name-based scorer needs adapters that actually pull names — file paths are a different dimension.
Full details: results/bench_3_full_post_fix/rollup.md.
To confirm the graph-query win isn't mempalace-specific, re-ran Bench #3 on Django (~3,300 files, 50,955 nodes, 138k edges) with a smaller pyright ground truth (31 triples; pyright's call-hierarchy resolution timed out on many Django symbols due to corpus size, producing fewer than the 200 target).
Filtered (only symbols with non-empty gold on the axis):
| Adapter | Graph support | Callers recall (N=19) | Callees recall (N=25) | Impact recall (N=19) | Avg latency |
|---|---|---|---|---|---|
| memtrace | ✅ | 0.816 | 0.169 | 0.751 | 2,104 ms |
| chromadb | ❌ NotSupported |
N/A | N/A | N/A | — |
| gitnexus | ✅ (regex) | 0.053 | 0.000 | 0.053 | 378 ms |
| cgc | ✅ | 0.000 | 0.000 | 0.000 | 455 ms |
Unfiltered (averaged over all 31 queries, including empty-gold rows — what
lives in results/bench_3_full_django/rollup.md):
| Adapter | Callers recall | Callees recall | Impact recall |
|---|---|---|---|
| memtrace | 0.500 | 0.136 | 0.460 |
| gitnexus | 0.032 | 0.000 | 0.032 |
| cgc | 0.000 | 0.000 | 0.000 |
The published numbers (and the SVG) use the filtered view for parity with
the mempalace bench_3_full_post_fix rollup, which filters the same way. A
15.4× lead on callers and 14.2× on impact are the numbers to cite.
Memtrace wins again — the only adapter returning real graph data on Django. Absolute numbers are lower than mempalace (0.816 vs 0.851 on callers) because:
- Django has ~50k symbols with many naming collisions (
get,save,Meta,clean) — Memtrace's name-keyed graph queries can't disambiguate without file/scope context. - Latency climbs 13× (166 ms → 2.2 s) because the graph is 13× larger and variable-length traversals scan more edges.
Generalization verified: graph queries work cross-corpus, and the
structural-win story holds — ChromaDB fundamentally can't compete,
GitNexus's regex extractor still misses on name-matching. Full details:
results/bench_3_full_django/rollup.md.
50 deterministic edits (add/rename/move/delete) applied to a hand-authored
21-file scratch_fixture corpus (task-queue domain). For each edit we
measure wall-time from file-change to first successful re-query
(time_to_queryable_p95 = primary axis, lower is better), plus a
staleness probe that asks the adapter for the OLD symbol name after
the edit and counts the cases where the pre-edit state still resolves.
| Adapter | Reindex supported | Queryable | t_queryable p95 | p50 | Reindex p50 | Staleness | Errors |
|---|---|---|---|---|---|---|---|
| memtrace | ✅ | 97.8% | 42.5 ms | 20.7 ms | 1168 ms | 0.400 | 0 |
| cgc | ✅ | 100.0% | 613.7 ms | 519.8 ms | 1127 ms | 0.340 | 0 |
| chromadb | ✅ | 76.1% | 5001 ms (timeout) | 55.7 ms | 66.8 ms | 0.680 | 0 |
| gitnexus | ❌ NotSupported |
N/A | N/A | N/A | — | — | 0 |
Memtrace wins time_to_queryable_p95 — 42.5 ms vs CGC 613.7 ms
(14.4× faster) and vs ChromaDB's 5001 ms deadline (>118× faster
at p95). Memtrace's index_directory(incremental=true, skip_embed=true)
re-parses only the changed files and their dependents; queries return
within ~20 ms once the incremental job completes.
Honest-loss columns:
- ChromaDB: re-embeds the chunks of touched files on re-index, but the symbol name rarely appears in the top-k vector-similar chunks (24% of edits never become queryable before the 5-second deadline). Staleness is 68% — the pre-edit chunks remain retrievable because renames inside a chunk don't invalidate neighbouring chunks.
- GitNexus: eval-server is batch-only; no per-file reindex API.
Returns
NotSupportedacross all 50 edits, which is the correct answer. It cannot compete on incremental freshness. - CGC:
cgc index <file>works per-file but each invocation reboots the FalkorDB pipeline (~1.1 s p50). p95 is a solid 614 ms — second place.
Staleness caveat for Memtrace (40%): renames using
index_directory(incremental=true, clear_existing=false) leave the
pre-rename CodeNode in the graph — find_symbol(old_name) still returns
the stale location until a full re-index drops the orphans. This is the
honest trade for Memtrace's sub-50ms freshness: incremental parse +
merge doesn't invalidate the prior AST hash. An orphan-sweep pass on
incremental reindex is the obvious next optimisation.
Reproduce:
benchmarks/.venv/bin/python -m benchmarks.suite.benches.bench_4_incremental.driver
cat benchmarks/suite/results/bench_4_full/rollup.mdFull per-edit jsonl + rollup: results/bench_4_full/.