Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300
Open
awconstable wants to merge 3 commits intoDeusData:mainfrom
Open
Conversation
…ter, cheap count Three compounding bugs caused 1.5–8.5s latency on name_pattern= searches against large projects (216K nodes), now reduced to ~0ms query time (cold-start dominates): Fix 1 — regex compiled once per statement, not once per row sqlite_regexp / sqlite_iregexp now use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. Previously cbm_regcomp + cbm_regfree ran for every row scanned. Fix 2 — LIKE pre-filter cuts rows reaching the regex Wire cbm_extract_like_hints (already implemented but dead) into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%', letting the idx_nodes_name index satisfy the LIKE clause first and passing only matching rows to iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32 to accommodate extra bind slots. Fix 3 — count query no longer runs per-row edge subqueries The count SQL previously wrapped the full SELECT (which includes two correlated subqueries for in_deg / out_deg) in SELECT COUNT(*) FROM (...), executing those edge counts for every matching row even though the count needs none of that. Non-degree-filter path now uses SELECT COUNT(*) FROM nodes n WHERE <same WHERE>, which has no per-row subqueries. Degree-filter path retains the wrapped form since it needs those columns for the filter. Benchmark on home-ubuntu-dev-sis (216K nodes, 509MB DB): Query BEFORE AFTER speedup name_pattern=.*Controller.* 3099ms 508ms 6× name_pattern=.*Service.* 2006ms 506ms 4× name_pattern=.*Repository.* 2006ms 508ms 4× name_pattern=specificFuncName 1506ms 507ms 3× label=Method + name_pattern=.*get.* 8509ms 509ms 17× name_pattern=.*Approve.* 1506ms 507ms 3× name_pattern=.*authorize.* 1506ms 509ms 3× The ~500ms floor is cold-start I/O (opening a 509MB file from disk). In the long-running MCP server process the warm-cache query time is sub-millisecond. All store search tests pass including pagination, degree filter, and extract_like_hints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make project a required CLI argument instead of a hardcoded name, and remove internal query strings used during development testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flat BM25 queries of the form: SELECT ... FROM nodes_fts JOIN nodes WHERE MATCH ? AND project=? ORDER BY bm25() LIMIT N block FTS5 WAND/MaxScore early-exit — the outer JOIN+WHERE is invisible to the FTS5 planner, so it scores every matching document before any filter fires. On a large codebase with 100K+ matches this causes 2–16 minute queries. Fix: two-step subquery. The inner FTS5-only query: SELECT rowid, bm25(nodes_fts) FROM nodes_fts WHERE MATCH ? ORDER BY bm25() LIMIT 2000 can early-terminate because no outer predicate blocks it. The outer query then joins and filters at most BM25_INNER_LIMIT (2000) candidates. The count query uses the identical inner-limit subquery, so it benefits too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #254
Root cause
Three compounding bugs caused
name_pattern=searches to scan every node with an expensive compiled regex, regardless of how selective the pattern was:sqlite_iregexp/sqlite_regexprecompiled the regex on every row —cbm_regcomp+cbm_regfreefired once per node for the full table.SELECT COUNT(*) FROM (...), doubling the scan with identical per-row overhead.cbm_extract_like_hintswas implemented and correct but never called — the LIKE pre-filter that should cut the regex scan to only matching rows was dead code.Changes
Fix 1 — regex cached per statement (
sqlite_regexp/sqlite_iregexp)Use
sqlite3_get_auxdata/sqlite3_set_auxdatato cache the compiledcbm_regex_tfor the lifetime of the statement.cbm_regcompis now called exactly once per query, not once per row.Fix 2 — LIKE pre-filter wired in (
where_add_like_hints,search_where_basic)Wire
cbm_extract_like_hintsintosearch_where_basicvia a newwhere_add_like_hintshelper. For.*Controller.*this prependsn.name LIKE '%Controller%'; theidx_nodes_nameindex satisfies the LIKE clause and only matching rows reachiregexp(). Addedsearch_like_pool_tto manage the malloc'd LIKE strings across both statement executions.ST_SEARCH_MAX_BINDSraised 16 → 32.Fix 3 — count query stripped of per-row edge subqueries
For the common no-degree-filter path, the count SQL is now
SELECT COUNT(*) FROM nodes n WHERE <same WHERE>— no correlatededgessubqueries. The degree-filter path retains the wrapped form since it needs those columns for the filter.Benchmark
Tested on a large PHP codebase (~200K nodes):
name_pattern=.*Controller.*name_pattern=.*Service.*name_pattern=.*Repository.*name_pattern=specificFunctionNamelabel=Method+name_pattern=.*get.*The ~500ms floor is cold-start I/O when spawning a fresh process against a ~500MB database. In the long-running MCP server (warm file cache) the query time is sub-millisecond.
A reusable benchmark script is included at
scripts/benchmark-search-graph.sh.Tests
All store search tests pass including
store_search_pagination(offset-past-end total count),store_search_degree_filter, and the fullstore_extract_like_hintssuite.