Skip to content

search_graph query= takes minutes on large codebases — FTS5 early termination blocked by outer JOIN #301

@awconstable

Description

@awconstable

Summary

search_graph with a query= (BM25 path) is catastrophically slow on large codebases. A 6-word query against a ~200K-node database takes 2–16 minutes per call.

Root cause

The current flat query:

SELECT ... FROM nodes_fts
JOIN nodes n ON n.id = nodes_fts.rowid
WHERE nodes_fts MATCH ?
  AND n.project = ?
  AND n.label NOT IN (...)
ORDER BY bm25(nodes_fts) LIMIT 20

blocks SQLite FTS5's WAND/MaxScore early-exit optimisation. FTS5 can short-circuit `ORDER BY bm25() LIMIT N` only when it controls the entire query plan. The outer `JOIN` + `WHERE n.project = ?` predicate is invisible to the FTS5 planner, so it must score every matching document before the outer filter can discard any of them. On a codebase with 100K+ matches for common terms, this is catastrophic.

The same problem affects the count query, which makes two full scans per `search_graph` call.

Reproduction

On a large codebase (~200K nodes, ~500MB database):

Query Time
`query=approve apps authorization school` ~18 000ms
`query=Group User Details Manage All Users` ~120 000ms
`query=dev portal approve integration third party` ~960 000ms

Fix

Two-step subquery: let FTS5 drive the inner query alone (no outer predicates), then join/filter the top-N candidates:

```sql
SELECT ...
FROM (
SELECT rowid, bm25(nodes_fts) AS base_rank
FROM nodes_fts WHERE nodes_fts MATCH ?
ORDER BY base_rank LIMIT 2000 -- FTS5 CAN early-terminate here
) fts
JOIN nodes n ON n.id = fts.rowid
WHERE n.project = ? AND n.label NOT IN (...)
ORDER BY rank LIMIT ? OFFSET ?
```

SQLite CAN early-terminate the inner plain FTS5 subquery because no outer predicate blocks it. The outer join/filter then works on at most 2000 rows.

The count query uses the identical inner-limit subquery structure, so it benefits from the same optimisation.

Expected result

Sub-second queries on warm cache; bounded by `BM25_INNER_LIMIT` (2000) candidates regardless of total match set size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstability/performanceServer crashes, OOM, hangs, high CPU/memory

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions