Skip to content

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300

Open
awconstable wants to merge 3 commits intoDeusData:mainfrom
arbor-education:fix/254-search-graph-name-pattern-performance
Open

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300
awconstable wants to merge 3 commits intoDeusData:mainfrom
arbor-education:fix/254-search-graph-name-pattern-performance

Conversation

@awconstable
Copy link
Copy Markdown

Fixes #254

Root cause

Three compounding bugs caused name_pattern= searches to scan every node with an expensive compiled regex, regardless of how selective the pattern was:

  1. sqlite_iregexp / sqlite_regexp recompiled the regex on every row — cbm_regcomp + cbm_regfree fired once per node for the full table.
  2. The count query wrapped the full SELECT (including two correlated edge-count subqueries per row) in SELECT COUNT(*) FROM (...), doubling the scan with identical per-row overhead.
  3. cbm_extract_like_hints was implemented and correct but never called — the LIKE pre-filter that should cut the regex scan to only matching rows was dead code.

Changes

Fix 1 — regex cached per statement (sqlite_regexp / sqlite_iregexp)
Use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. cbm_regcomp is now called exactly once per query, not once per row.

Fix 2 — LIKE pre-filter wired in (where_add_like_hints, search_where_basic)
Wire cbm_extract_like_hints into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%'; the idx_nodes_name index satisfies the LIKE clause and only matching rows reach iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32.

Fix 3 — count query stripped of per-row edge subqueries
For the common no-degree-filter path, the count SQL is now SELECT COUNT(*) FROM nodes n WHERE <same WHERE> — no correlated edges subqueries. The degree-filter path retains the wrapped form since it needs those columns for the filter.

Benchmark

Tested on a large PHP codebase (~200K nodes):

Query Before After Speedup
name_pattern=.*Controller.* 3099ms 508ms
name_pattern=.*Service.* 2006ms 506ms
name_pattern=.*Repository.* 2006ms 508ms
name_pattern=specificFunctionName 1506ms 507ms
label=Method + name_pattern=.*get.* 8509ms 509ms 17×

The ~500ms floor is cold-start I/O when spawning a fresh process against a ~500MB database. In the long-running MCP server (warm file cache) the query time is sub-millisecond.

A reusable benchmark script is included at scripts/benchmark-search-graph.sh.

Tests

All store search tests pass including store_search_pagination (offset-past-end total count), store_search_degree_filter, and the full store_extract_like_hints suite.

awconstable and others added 3 commits April 30, 2026 06:38
…ter, cheap count

Three compounding bugs caused 1.5–8.5s latency on name_pattern= searches against
large projects (216K nodes), now reduced to ~0ms query time (cold-start dominates):

Fix 1 — regex compiled once per statement, not once per row
  sqlite_regexp / sqlite_iregexp now use sqlite3_get_auxdata / sqlite3_set_auxdata
  to cache the compiled cbm_regex_t for the lifetime of the statement. Previously
  cbm_regcomp + cbm_regfree ran for every row scanned.

Fix 2 — LIKE pre-filter cuts rows reaching the regex
  Wire cbm_extract_like_hints (already implemented but dead) into search_where_basic
  via a new where_add_like_hints helper. For .*Controller.* this prepends
  n.name LIKE '%Controller%', letting the idx_nodes_name index satisfy the LIKE
  clause first and passing only matching rows to iregexp(). Added search_like_pool_t
  to manage the malloc'd LIKE strings across both statement executions.
  ST_SEARCH_MAX_BINDS raised 16 → 32 to accommodate extra bind slots.

Fix 3 — count query no longer runs per-row edge subqueries
  The count SQL previously wrapped the full SELECT (which includes two correlated
  subqueries for in_deg / out_deg) in SELECT COUNT(*) FROM (...), executing those
  edge counts for every matching row even though the count needs none of that.
  Non-degree-filter path now uses SELECT COUNT(*) FROM nodes n WHERE <same WHERE>,
  which has no per-row subqueries. Degree-filter path retains the wrapped form
  since it needs those columns for the filter.

Benchmark on home-ubuntu-dev-sis (216K nodes, 509MB DB):

  Query                                BEFORE    AFTER   speedup
  name_pattern=.*Controller.*          3099ms    508ms     6×
  name_pattern=.*Service.*             2006ms    506ms     4×
  name_pattern=.*Repository.*          2006ms    508ms     4×
  name_pattern=specificFuncName        1506ms    507ms     3×
  label=Method + name_pattern=.*get.*  8509ms    509ms    17×
  name_pattern=.*Approve.*             1506ms    507ms     3×
  name_pattern=.*authorize.*           1506ms    509ms     3×

The ~500ms floor is cold-start I/O (opening a 509MB file from disk). In the
long-running MCP server process the warm-cache query time is sub-millisecond.

All store search tests pass including pagination, degree filter, and extract_like_hints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make project a required CLI argument instead of a hardcoded name,
and remove internal query strings used during development testing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flat BM25 queries of the form:
  SELECT ... FROM nodes_fts JOIN nodes WHERE MATCH ? AND project=? ORDER BY bm25() LIMIT N
block FTS5 WAND/MaxScore early-exit — the outer JOIN+WHERE is invisible to
the FTS5 planner, so it scores every matching document before any filter fires.
On a large codebase with 100K+ matches this causes 2–16 minute queries.

Fix: two-step subquery.  The inner FTS5-only query:
  SELECT rowid, bm25(nodes_fts) FROM nodes_fts WHERE MATCH ? ORDER BY bm25() LIMIT 2000
can early-terminate because no outer predicate blocks it.  The outer query
then joins and filters at most BM25_INNER_LIMIT (2000) candidates.

The count query uses the identical inner-limit subquery, so it benefits too.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DeusData DeusData added bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory labels May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search_graph on large datasets with name_pattern= is slow

2 participants