Date: 2026-05-01
Hardware: Apple M3 Max, 14 cores (10P + 4E), 36 GB RAM
Methodology: 1,000-row dataset × 3 query variants (exact, split, typo) = 3,000 query cases per adapter, against Python AST ground truth. Each tool indexes the same corpora once and serves queries from its own pipeline.
This doc covers the find_code (hybrid retrieval) tool — the path an agent uses for natural-language code search, NOT the find_symbol exact-lookup path covered in BENCHMARKS-v0.3.22.md. Both ship in the same binary; agents pick whichever fits the question.
find_code is a hybrid retrieval pipeline: BM25 (Tantivy fork, custom code tokenizer, per-field boosts) ⊕ semantic vector search (HNSW over 768-d code-specialised embeddings) ⊕ graph signal (callers count prior, exact-symbol order boost), fused via Reciprocal Rank Fusion, then re-ranked by a quantised ONNX cross-encoder (BAAI/bge-reranker-base, 75 MB) over a fixed 30-candidate pool.
v0.3.29 ships two changes from v0.3.21 (the previous published baseline):
- The whole BM25/rerank stack landed. Tantivy fork with custom
B = 0.45, code tokenizer, n-gram subword field, exact-name STRING field, per-field boosts (5/3/2/2/2/1), pseudo-relevance feedback, ANN filter pushdown, jina-embeddings-v2-base-code (768d), and the bge-reranker-base cross-encoder. See git history for the spec stack. - Decouple rerank pool from caller's
LIMIT. v0.3.29 always reranks a 30-candidate pool then truncates to the requested LIMIT. Sofind_code(limit=3)returns the same top-3 afind_code(limit=10)call would — caller saves tokens without losing precision. See the commitrelease(0.3.29)for details.
Pre-rerank baseline (v0_3_21) and post-rerank (v0.3.29) both at LIMIT=10. GitNexus and ChromaDB columns are from the v0_3_21 run — they did not change between releases (no version bumps shipped). Same dataset, same machine, same query order across all three.
| Adapter | cov | acc@1 | acc@5 | acc@10 | MRR | tokens | p50 lat | p95 lat |
|---|---|---|---|---|---|---|---|---|
| memtrace v0.3.29 (rerank on, L=10) | 100.0% | 73.9% | 88.2% | 90.0% | 0.801 | 473 | 872 ms | 1012 ms |
| memtrace v0_3_21 (no rerank, L=10) | 100.0% | 48.9% | 84.0% | 88.5% | 0.632 | 1526 | 484 ms | 519 ms |
| gitnexus query (v0_3_21) | 98.6% | 38.6% | 69.9% | 72.8% | 0.518 | 200 | 850 ms | 1840 ms |
| chromadb vector (v0_3_21) | 99.4% | 28.9% | 48.9% | 53.3% | 0.372 | 1837 | 57 ms | 84 ms |
CGC is excluded from this matrix because its current public surface has graph + exact / full-text / substring search but no BM25 / vector / RRF retrieval path that maps onto find_code.
| metric | v0_3_21 | v0.3.29 | delta |
|---|---|---|---|
| acc@1 | 48.9% | 73.9% | +25.0 pts |
| acc@5 | 84.0% | 88.2% | +4.2 pts |
| acc@10 | 88.5% | 90.0% | +1.5 pts |
| MRR | 0.632 | 0.801 | +0.169 |
| avg tokens | 1526 | 473 | −1053 (−69%) |
| p50 latency | 484 ms | 872 ms | +388 ms |
The +388 ms p50 hit is the rerank inference cost. Small price for the +25 pt jump on top-1, the 3.2× token reduction, and the +1.5 pt recall lift.
The rerank earns its keep on natural-language and typo queries — exactly what an agent will phrase like.
| variant | v0_3_21 acc@1 | v0.3.29 acc@1 | delta |
|---|---|---|---|
exact (literal symbol name) |
79.7% | 77.5% | −2.2 pts |
split (snake/camelCase split into tokens) |
31.1% | 70.8% | +39.7 pts |
typo (single-character typo) |
35.8% | 73.4% | +37.6 pts |
exact regressed 2.2 pts — within noise for 1k queries (22 cases), and the rerank occasionally over-thinks a clean string match. split and typo are where the new stack shines: doubling and tripling the rate at which the right symbol comes back at rank 1. split is the closest variant to a real agent query ("the function that creates SQL test tables for postgres" rather than the exact identifier).
Recall ceiling is also up across the board.
| variant | v0_3_21 acc@10 | v0.3.29 acc@10 | delta |
|---|---|---|---|
exact |
92.4% | 93.4% | +1.0 pt |
split |
86.0% | 86.5% | +0.5 pt |
typo |
87.1% | 90.0% | +2.9 pts |
Same harness, same query set construction. Memtrace v0.3.29 vs OLD baseline vs published competitors.
| Adapter | cov | acc@1 | acc@5 | acc@10 | MRR | tokens | p50 lat | p95 lat |
|---|---|---|---|---|---|---|---|---|
| memtrace v0.3.29 (rerank on, L=10) | 100.0% | 93.8% | 99.3% | 99.7% | 0.961 | 419 | 447 ms | 554 ms |
| memtrace v0_3_21 (no rerank, L=10) | 100.0% | 44.2% | 92.6% | 98.7% | 0.625 | 1518 | 35 ms | 51 ms |
| chromadb vector (v0_3_21) | 100.0% | 59.7% | 82.9% | 85.5% | 0.695 | 1936 | 56 ms | 84 ms |
| gitnexus query (v0_3_21) | 100.0% | 11.7% | 79.8% | 94.6% | 0.346 | 357 | 390 ms | 950 ms |
v0.3.29 deltas vs v0_3_21 (memtrace-on-memtrace):
| metric | v0_3_21 | v0.3.29 | delta |
|---|---|---|---|
| acc@1 | 44.2% | 93.8% | +49.6 pts |
| acc@5 | 92.6% | 99.3% | +6.7 pts |
| acc@10 | 98.7% | 99.7% | +1.0 pt |
| MRR | 0.625 | 0.961 | +0.336 |
| avg tokens | 1518 | 419 | −1099 (−72%) |
| p50 latency | 35 ms | 447 ms | +412 ms |
The acc@1 lift is +49.6 pts on mempalace — bigger than Django (+25 pts) because the corpus is small enough that the right candidate is almost always in the rerank pool, and the cross-encoder just needs to surface it.
By query variant (acc@1):
| variant | v0_3_21 | v0.3.29 | delta |
|---|---|---|---|
exact |
96.6% | 94.5% | −2.1 pts |
split |
16.9% | 94.9% | +78.0 pts |
typo |
19.0% | 91.9% | +72.9 pts |
Same pattern as Django, even more pronounced: exact drifts down ~2 pts (rerank over-thinks clean string matches), but split and typo go from broken (17–19%) to nearly perfect (92–95%). The agent-realistic workload — natural-language queries and typos — is where the rerank stack earns its keep.
Vs competitors (acc@1): memtrace v0.3.29 is 1.57× over ChromaDB (93.8% vs 59.7%) and 8.0× over GitNexus (93.8% vs 11.7%) on the same task. The pre-rerank v0_3_21 stack actually trailed ChromaDB on acc@1 (44.2% vs 59.7%); the v0.3.29 stack flips that decisively.
CGC is excluded from this matrix because its current public surface has graph + exact / full-text / substring search but no BM25 / vector / RRF retrieval path that maps onto find_code.
# Install Memtrace v0.3.29
npm install -g memtrace@0.3.29
# Index the corpora
memtrace index /path/to/django
memtrace index /path/to/mempalace
# Run the bench (Django)
ADAPTERS=memtrace \
MEMTRACE_RERANK=on \
REPO_ROOT=/path/to/django \
DATASET_FILE=benchmarks/fair/dataset_1k_django.json \
RESULTS_FILE=benchmarks/fair/results_hybrid_1k_django_v0_3_29.json \
LIMIT=10 \
MAX_QUERIES=1000 \
QUERY_VARIANTS=exact,split,typo \
python3 benchmarks/fair/run_hybrid_retrieval_benchmark.pyThe published v0_3_21 baselines in results_hybrid_1k_*_v0_3_21.json were generated by the same script with MEMTRACE_RERANK=off (or unset) and an older Memtrace binary.
The full spec stack lives in the closed-source repo; the agent-visible diff is:
- Tantivy fork (
syncable-dev/tantivy@bm25-b-0.45) —B = 0.45instead of upstream 0.75, calibrated against code-style identifier tokens. Plus the per-field boosts (NAME=5, SIG=3, SCOPE=2, KIND=2, LANG=2, CONTENT=1) tuned via a 6-row Sourcegraph-style sweep. - Custom code tokenizer — splits camelCase / snake_case / kebab-case at query and index time, so
getUserByIdmatchesuser_by_id. Plus the n-gram subword field for typo recall. - Exact-name STRING field with a separate boost path so identifier exact matches don't lose to BM25 corpus stats on rare names.
- Pseudo-relevance feedback — when the primary search lands fewer than 3 results, expand the query with the top hit's
name+scope_pathand re-issue once. Lifts recall on short / generic queries without hurting precision. - Caller-count prior (
1 + ln(1 + direct_callers)) — log-compressed structural boost so a function with 47 callers ranks above a same-named test fixture. - Code-specialised embeddings —
jinaai/jina-embeddings-v2-base-code(768d, code-tuned BERT). Replaced the oldbge-small-en-v1.5(384d, English text). Override withMEMTRACE_EMBED_MODEL=bge-smallif you want the legacy stack. - Cross-encoder rerank — quantised ONNX
BAAI/bge-reranker-base(75 MB) over the fused candidate pool. Default ON in v0.3.29 for agent traffic. - Decouple rerank pool from
LIMIT— v0.3.29: rerank always sees 30 candidates, truncates after. MeansLIMIT=3returns the same top-3 asLIMIT=10, so callers can save tokens with no quality loss. .claude/walker exclude +cleanup_stale_recordsMCP tool — agent-worktree pollution fix and a targeted scrub for orphan records left behind by removed worktrees / branch deletes / files-deleted-while-stopped.
See the closed-source specs/bm25-improvements/ directory for the per-feature design docs and ablation evidence.