Skip to content

Latest commit

 

History

History

README.md

Memtrace Benchmark Suite

Extended benchmark harness targeting MCP coding agents. Wraps the existing benchmarks/fair/ exact-symbol bench under a shared framework and provides space for four additional sub-benches (see docs/superpowers/specs/2026-04-21-memtrace-benchmark-suite-design.md).

What's in this version

  • Bench #0 — Exact-symbol lookup (wraps fair/dataset.json) — Memtrace wins 96.7% Acc@1
  • Bench #1 — Token economy (SNR + MRR) — Memtrace wins 495.52 vs 126.90 Acc@1/k-tok (3.9× lead, 15× over ChromaDB)
  • Bench #3 — Graph queries (pyright ground truth) — Memtrace sweeps callees (0.429) + impact (0.874) vs 0.000; all adapters tie at 0 on callers (disclosed honest loss)
  • Bench #4 — Incremental / staleness — Memtrace wins time_to_queryable_p95 = 42.5 ms (14× faster than CGC, 118× faster than ChromaDB)
  • Bench #5 — Agent-level SWE completion — gated behind RUN_AGENT_BENCH=1; skeleton + 5-task dataset only
  • Adapters: Memtrace, ChromaDB, GitNexus, CGC

Bench #2 (Django intent retrieval) lives on bench-2-django-intent — see that branch's results. First run had an indexing-setup issue (only ChromaDB had Django fully indexed); reliable re-run TBD.

Reproduce Bench #0

# 1. Build Memtrace
cargo build --release

# 2. Install Python deps into benchmarks/.venv
./benchmarks/suite/scripts/bootstrap.sh

# 3. Point the suite at your checkouts (used by all benches)
export MEMPALACE_PATH=/path/to/mempalace
export DJANGO_PATH=/path/to/django     # only needed for Bench #2 + Bench #3 Django

# 4. Start GitNexus eval-server (optional; skipped if not running)
cd "$MEMPALACE_PATH" && npx gitnexus eval-server &

# 5. Run the bench
benchmarks/.venv/bin/python -m benchmarks.suite run \
    --bench 0 --adapters memtrace,chromadb,gitnexus,cgc

Output: benchmarks/suite/results/bench_0/rollup.md (markdown table + primary-axis winner), rollup.csv (same data, flat), per-adapter jsonl.

Memtrace pre-requisite: memtrace index /path/to/mempalace must have populated the ArcadeDB graph before Bench #0 is run. Without an indexed corpus, Memtrace's find_symbol returns empty results and the bench reports 0% accuracy. The fair/ harness has this dependency too; it's a deployment prerequisite, not a bench bug.

Parity with fair/ (first full run, 2026-04-21)

1,000-query parity check: suite Bench #0 vs the original fair/results.json on the same dataset, same four adapters, same mempalace commit. Tolerance: ±0.5% absolute on Acc@1 (ChromaDB ±1% due to embedding non-determinism).

Adapter Suite Acc@1 fair Acc@1 Δ Suite Acc@5 fair Acc@5 Suite latency fair latency
memtrace 96.7% 96.7% 0.0 99.9% 100.0% 7.54 ms 9.16 ms
chromadb 62.2% 62.3% -0.1 85.9% 86.1% 57.13 ms 58.46 ms
gitnexus 27.1% 27.1% 0.0 89.7% 89.7% 167.73 ms 191.21 ms
cgc 6.4% 6.4% 0.0 66.0% 66.4% 1452.78 ms 1627.17 ms

All four adapters match within tolerance. Per-adapter raw jsonl and the rollup markdown/CSV are in benchmarks/suite/results/bench_0_full/ (gitignored — regenerate with the repro commands above).

Bench #1 — Token Economy (SNR + MRR)

Same 1,000-query dataset as Bench #0, but re-scored through the lens that matters most to MCP coding agents: how many correct answers do you get per 1,000 response tokens? Every tool response eats context window, so an adapter that's accurate at low token cost wins.

Primary axis: acc_at_1_per_kilo_token = Acc@1 ÷ (avg_tokens ÷ 1000)

Adapter Acc@1 MRR Avg tokens Tokens/hit Acc@1/k-tok SNR Avg latency
memtrace 96.7% 0.980 195 202 495.52 86.7% 7.54 ms
gitnexus 27.1% 0.513 214 788 126.90 19.2% 167.73 ms
chromadb 62.2% 0.723 1,937 3,114 32.11 62.6% 57.13 ms
cgc 6.4% 0.357 221 3,452 28.97 9.4% 1,452.78 ms

Memtrace wins acc_at_1_per_kilo_token — 495.52 vs GitNexus at 126.90 (3.9× lead over next competitor; 15.4× over ChromaDB, the vector-DB archetype). The tokens/hit column tells the story starkly: to surface one correct top-1 answer, ChromaDB spends 3,114 tokens, CGC spends 3,452, GitNexus 788 — Memtrace does it in 202.

SNR = fraction of response tokens that were part of a correct top-1 hit. Memtrace 86.7% means almost every token the tool returns is "signal"; GitNexus 19.2% means ~4 out of 5 response tokens are flow-description noise around the right answer.

Re-generate from the committed Bench #0 jsonl:

benchmarks/.venv/bin/python -c "
from pathlib import Path
from benchmarks.suite.benches.bench_1_snr_mrr.run import run_from_bench_0_jsonl
combined = next(Path('benchmarks/suite/results/bench_0_full').glob('combined-*.jsonl'))
run_from_bench_0_jsonl(combined, Path('benchmarks/suite/results/bench_1_token_economy'))
"
cat benchmarks/suite/results/bench_1_token_economy/rollup.md

Why this exists alongside fair/

fair/ is preserved unchanged so the published numbers remain reproducible at the exact script that generated them. The suite is the forward-going framework: same dataset, new contract, new reporting format, ready for Bench #1–#5.

Directory layout

Path Role
contract.py Adapter base class + dataclasses (QueryResult, NotSupported, …)
scoring.py Pure-function metrics (acc_at_k, mrr, signal_to_noise, …)
runner.py Bench-agnostic runner. Guarantees setup/teardown + per-query error capture
reporting.py jsonl → markdown + CSV rollup
adapters/ One file per competitor
benches/bench_N_*/run.py Per-bench PRIMARY_AXIS + run_with_adapter
versions.toml Pinned versions — single source of truth
tests/ Unit tests (fast) + integration tests (marker-gated)

Adding a new adapter

See CONTRIBUTING_ADAPTER.md (scaffold in follow-up plan).

Primary-axis rule

Each sub-bench declares one primary metric. Memtrace must win it. A 10% tolerance applies to secondary metrics. See the design doc for details.

CI matrix

Trigger Scope Budget
PR Unit tests + smoke (MAX_QUERIES=20, Memtrace only) ≤ 3 min
Nightly Full mode, Tier 1 + Tier 3 adapters ≤ 45 min
Release cut All adapters, all corpora, + Bench #5 ~3 h + 1 h

Bench #3 — Graph Queries (pyright ground truth, FAIR EXTRACTORS)

200-symbol pyright call-hierarchy ground truth, 4 adapters, 2026-04-22. Re-run after two fixes:

  1. analyze_relationships Cypher anchor in relationships.rs:108-110 (took Memtrace callers from 0 → 0.851)
  2. Proper GitNexus flow-text + CGC table extractors (previous extractors only pulled file paths; the new ones pull caller SYMBOL NAMES, which is what name-based scoring needs). Previously CGC scored 0.000 on all axes because of this parser bug — not because it lacked graph data.
Adapter Graph support Callers recall Callees recall Impact recall Avg latency
memtrace 🏆 0.851 0.429 0.874 243 ms
cgc 0.584 0.326 — (not impl.) 463 ms
gitnexus 0.013 0.027 0.007 195 ms
chromadb NotSupported N/A N/A N/A

Recall averaged only over symbols whose pyright ground truth has non-empty gold on that axis (70 of 200 for callers/impact, 193 of 200 for callees). Unfiltered numbers in results/bench_3_full/rollup.md.

What changed vs the previous "shutout" framing

Axis Earlier reported Fair re-run Honest delta
Memtrace callers 0.851 0.851 unchanged
CGC callers 0.000 0.584 CGC is actually competitive — earlier number was a parser bug
GitNexus callers 0.000 0.013 GitNexus's flow model genuinely doesn't map to "which symbols call X" by name

Memtrace still wins cleanly — 1.46× lead over CGC on callers, 1.32× on callees, and CGC has no impact implementation so Memtrace's 0.874 on transitive-callers stands alone. But the margin is smaller than the earlier misleading "vs 0.000" result, and CGC deserves credit for serious graph capability on callers/callees.

ChromaDB remains correctly excluded — vector DBs cannot answer graph queries by construction. The NotSupported column IS the Bench #3 story for vector databases.

What was wrong with the earlier result

Initial adapters pulled only file paths from GitNexus/CGC free-text output. Bench #3 scores by matching symbol NAMES (since adapter path formats disagreed), so path-only extraction scored 0.000 everywhere, producing a fake "Memtrace sweeps vs 0.000" framing. A user caught it: "GitNexus is a graph solution — why is it 0.000?" It wasn't. Our extractor was broken. Fixed in commits <placeholder>: rewrote the GitNexus flow-text parser to pull Symbol <name> → <file> pairs correctly, and rewrote the CGC table parser to use the first column (Caller Function) with COLUMNS=400 env to defeat Rich truncation.

Lesson: benchmark scoring methodology is as load-bearing as the query itself. Any name-based scorer needs adapters that actually pull names — file paths are a different dimension.

Full details: results/bench_3_full_post_fix/rollup.md.

Bench #3 on Django (generalization test, 2026-04-22)

To confirm the graph-query win isn't mempalace-specific, re-ran Bench #3 on Django (~3,300 files, 50,955 nodes, 138k edges) with a smaller pyright ground truth (31 triples; pyright's call-hierarchy resolution timed out on many Django symbols due to corpus size, producing fewer than the 200 target).

Filtered (only symbols with non-empty gold on the axis):

Adapter Graph support Callers recall (N=19) Callees recall (N=25) Impact recall (N=19) Avg latency
memtrace 0.816 0.169 0.751 2,104 ms
chromadb NotSupported N/A N/A N/A
gitnexus ✅ (regex) 0.053 0.000 0.053 378 ms
cgc 0.000 0.000 0.000 455 ms

Unfiltered (averaged over all 31 queries, including empty-gold rows — what lives in results/bench_3_full_django/rollup.md):

Adapter Callers recall Callees recall Impact recall
memtrace 0.500 0.136 0.460
gitnexus 0.032 0.000 0.032
cgc 0.000 0.000 0.000

The published numbers (and the SVG) use the filtered view for parity with the mempalace bench_3_full_post_fix rollup, which filters the same way. A 15.4× lead on callers and 14.2× on impact are the numbers to cite.

Memtrace wins again — the only adapter returning real graph data on Django. Absolute numbers are lower than mempalace (0.816 vs 0.851 on callers) because:

  1. Django has ~50k symbols with many naming collisions (get, save, Meta, clean) — Memtrace's name-keyed graph queries can't disambiguate without file/scope context.
  2. Latency climbs 13× (166 ms → 2.2 s) because the graph is 13× larger and variable-length traversals scan more edges.

Generalization verified: graph queries work cross-corpus, and the structural-win story holds — ChromaDB fundamentally can't compete, GitNexus's regex extractor still misses on name-matching. Full details: results/bench_3_full_django/rollup.md.

Bench #4 — Incremental / Staleness (2026-04-21)

50 deterministic edits (add/rename/move/delete) applied to a hand-authored 21-file scratch_fixture corpus (task-queue domain). For each edit we measure wall-time from file-change to first successful re-query (time_to_queryable_p95 = primary axis, lower is better), plus a staleness probe that asks the adapter for the OLD symbol name after the edit and counts the cases where the pre-edit state still resolves.

Adapter Reindex supported Queryable t_queryable p95 p50 Reindex p50 Staleness Errors
memtrace 97.8% 42.5 ms 20.7 ms 1168 ms 0.400 0
cgc 100.0% 613.7 ms 519.8 ms 1127 ms 0.340 0
chromadb 76.1% 5001 ms (timeout) 55.7 ms 66.8 ms 0.680 0
gitnexus NotSupported N/A N/A N/A 0

Memtrace wins time_to_queryable_p95 — 42.5 ms vs CGC 613.7 ms (14.4× faster) and vs ChromaDB's 5001 ms deadline (>118× faster at p95). Memtrace's index_directory(incremental=true, skip_embed=true) re-parses only the changed files and their dependents; queries return within ~20 ms once the incremental job completes.

Honest-loss columns:

  • ChromaDB: re-embeds the chunks of touched files on re-index, but the symbol name rarely appears in the top-k vector-similar chunks (24% of edits never become queryable before the 5-second deadline). Staleness is 68% — the pre-edit chunks remain retrievable because renames inside a chunk don't invalidate neighbouring chunks.
  • GitNexus: eval-server is batch-only; no per-file reindex API. Returns NotSupported across all 50 edits, which is the correct answer. It cannot compete on incremental freshness.
  • CGC: cgc index <file> works per-file but each invocation reboots the FalkorDB pipeline (~1.1 s p50). p95 is a solid 614 ms — second place.

Staleness caveat for Memtrace (40%): renames using index_directory(incremental=true, clear_existing=false) leave the pre-rename CodeNode in the graph — find_symbol(old_name) still returns the stale location until a full re-index drops the orphans. This is the honest trade for Memtrace's sub-50ms freshness: incremental parse + merge doesn't invalidate the prior AST hash. An orphan-sweep pass on incremental reindex is the obvious next optimisation.

Reproduce:

benchmarks/.venv/bin/python -m benchmarks.suite.benches.bench_4_incremental.driver
cat benchmarks/suite/results/bench_4_full/rollup.md

Full per-edit jsonl + rollup: results/bench_4_full/.