Skip to content

Commit 286d8ec

Browse files
committed
update docs and readme, additional cleanup when possible -> use native sqlglot features whenever possible
1 parent 86ed996 commit 286d8ec

8 files changed

Lines changed: 285 additions & 288 deletions

File tree

AGENTS.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,36 @@ poetry run ruff check sql_metadata # linting
140140
- Enabled rule sets: E, F, W (pycodestyle/pyflakes), C90 (mccabe), I (isort)
141141
- Exceptions: Use `# noqa: C901` for complex but necessary functions
142142

143+
## Review Practices
144+
145+
### Verify before grading severity
146+
147+
When reviewing code (or producing a critical review of a branch/PR), **spike
148+
every claim before attaching a severity or a "~N LoC removable" number**:
149+
150+
1. Read the tests that cover the code path you're flagging — they encode
151+
the actual contract you'd be changing.
152+
2. If the claim is "library X already handles this", actually run X against
153+
a handful of real inputs from the codebase and confirm the output shape
154+
matches what downstream code consumes.
155+
3. If the claim is "N lines removable", sketch the replacement and see
156+
whether tests still pass — mentally or via a throwaway branch.
157+
4. Only verified claims deserve HIGH severity or concrete LoC numbers.
158+
Unverified hunches belong in a "needs investigation" list, not a
159+
severity-ranked review.
160+
161+
Estimates without verification give false authority to findings that may
162+
not hold up. In a past v3 review four HIGH/MEDIUM items (comment
163+
extraction, scope-based resolution, LIMIT regex, "god class" LoC grade)
164+
dissolved within minutes of actual investigation; all four would have
165+
been caught by a pre-grade spike.
166+
167+
Before closing a review phase, re-read every HIGH and MEDIUM finding and
168+
confirm a verification step exists in the session transcript for each
169+
one. If a spike did not happen, downgrade or drop the finding before
170+
publishing. Codifying the rule in memory is not enough — it has to be
171+
applied *before* the claim is formed, not consulted afterwards.
172+
143173
## Error Handling Patterns
144174

145175
### Malformed SQL Detection

ARCHITECTURE.md

Lines changed: 46 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ sql-metadata v3 is a Python library that parses SQL queries and extracts metadat
1717
| [`comments.py`](sql_metadata/comments.py) | Comment extraction/stripping via tokenizer gaps | `extract_comments`, `strip_comments` |
1818
| [`keywords_lists.py`](sql_metadata/keywords_lists.py) | `QueryType` enum ||
1919
| [`utils.py`](sql_metadata/utils.py) | `UniqueList` (deduplicating list), `last_segment`, `DOT_PLACEHOLDER` ||
20+
| [`exceptions.py`](sql_metadata/exceptions.py) | Custom exception hierarchy | `InvalidQueryDefinition` |
2021
| [`generalizator.py`](sql_metadata/generalizator.py) | Query anonymisation for log aggregation | `Generalizator` |
2122

2223
---
@@ -126,7 +127,7 @@ def tables(self) -> List[str]:
126127
return self._tables
127128
```
128129

129-
**Regex fallbacks** — when `sqlglot.parse()` fails (raises `ValueError`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than raising an error.
130+
**Regex fallbacks** — when `sqlglot.parse()` fails (raises `InvalidQueryDefinition`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than propagating the error. `InvalidQueryDefinition` is a `ValueError` subclass defined in [`exceptions.py`](sql_metadata/exceptions.py) — catching `ValueError` still works for external callers.
130131

131132
---
132133

@@ -204,7 +205,7 @@ flowchart TD
204205
1. Parse with `sqlglot.parse()` (warnings suppressed)
205206
2. Check for degradation via `_is_degraded` — phantom tables (`IGNORE`, `""`), keyword-as-column names (`UNIQUE`, `DISTINCT`)
206207
3. If degraded and not the last dialect, try the next one
207-
4. If all fail, raise `ValueError("This query is wrong")`
208+
4. If all fail, raise `InvalidQueryDefinition` (a `ValueError` subclass from [`exceptions.py`](sql_metadata/exceptions.py))
208209

209210
---
210211

@@ -237,22 +238,24 @@ flowchart TB
237238

238239
#### DFS dispatch
239240

240-
The walk visits each node and dispatches to specialised handlers:
241+
The walk visits each node and routes it through `_dispatch_leaf`, which calls a specialised handler or inline branch depending on the node type:
241242

242-
| AST Node Type | Handler | What it does |
243+
| AST Node Type | Routing | What happens |
243244
|---------------|---------|-------------|
244-
| `exp.Star` | `_handle_star` | Adds `*` (skips if inside function like `COUNT(*)`) |
245-
| `exp.ColumnDef` | (inline) | Adds column name for CREATE TABLE DDL |
246-
| `exp.Identifier` | `_handle_identifier` | Adds column if in JOIN USING context |
245+
| `exp.Star` | inline in `_dispatch_leaf` | Adds `*` (skips if inside a function like `COUNT(*)`) |
246+
| `exp.ColumnDef` | inline in `_dispatch_leaf` | Adds column name for CREATE TABLE DDL |
247+
| `exp.Identifier` | inline in `_dispatch_leaf` | Adds column if in JOIN USING context |
247248
| `exp.CTE` | `_handle_cte` | Records CTE name, processes column definitions |
248249
| `exp.Column` | `_handle_column` | Main handler — resolves table alias, builds full name |
249-
| `exp.Subquery` (aliased) | (inline) | Records subquery name and depth for ordering |
250+
| `exp.Subquery` (aliased) | inline in `_dispatch_leaf` | Records subquery name and depth for ordering |
250251

251252
**Special processing** in `_process_child_key`:
252253
- SELECT expressions → `_handle_select_exprs` → iterates expressions, detects aliases
253254
- INSERT schema → `_handle_insert_schema` → extracts column list from `INSERT INTO t(col1, col2)`
254255
- JOIN USING → `_handle_join_using` → extracts column identifiers
255256

257+
**Error handling**`_handle_cte` raises `InvalidQueryDefinition` if a `WITH` clause contains an alias-less CTE (invalid SQL).
258+
256259
#### Clause classification
257260

258261
`_classify_clause` maps each `arg_types` key to a `columns_dict` section:
@@ -285,31 +288,30 @@ The walk visits each node and dispatches to specialised handlers:
285288

286289
**File:** [`table_extractor.py`](sql_metadata/table_extractor.py) | **Class:** `TableExtractor`
287290

288-
Walks the AST for `exp.Table` and `exp.Lateral` nodes, builds fully-qualified table names, and sorts results by first occurrence in the raw SQL.
291+
Walks the AST for `exp.Table` nodes, builds fully-qualified table names, and sorts results by each table identifier's character position recorded by sqlglot's tokenizer.
289292

290293
#### Extraction flow
291294

292295
```mermaid
293296
flowchart TB
294-
AST["sqlglot AST"] --> CHECK{"exp.Command?"}
295-
CHECK -->|Yes| REGEX["Regex fallback\n(_extract_tables_from_command)"]
296-
CHECK -->|No| CREATE{"exp.Create?"}
297-
CREATE -->|Yes| TARGET["Extract CREATE target"]
297+
AST["sqlglot AST"] --> CREATE{"exp.Create?"}
298+
CREATE -->|Yes| TARGET["_extract_create_target()\nTarget goes first"]
298299
CREATE -->|No| SKIP["skip"]
299300
TARGET --> COLLECT
300-
SKIP --> COLLECT["_collect_all()\nWalk exp.Table + exp.Lateral"]
301+
SKIP --> COLLECT["_table_nodes()\nfind_all(exp.Table), cached"]
301302
COLLECT --> FILTER["Filter out CTE names"]
302-
FILTER --> SORT["Sort by _first_position()\n(regex in raw SQL)"]
303-
SORT --> ORDER["_place_tables_in_order()\nCREATE target goes first"]
303+
FILTER --> SORT["Sort by _table_start_position()\n(Identifier.meta['start'])"]
304+
SORT --> FINAL["UniqueList:\nCREATE target + sorted tables"]
304305
```
305306

306307
**Key algorithms:**
307308

308-
- **Name construction**`_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL) and double-dot notation (`catalog..name`)
309-
- **Position sorting**`_first_position` finds each table name in the raw SQL via regex, preferring matches after table-introducing keywords (`FROM`, `JOIN`, `TABLE`, `INTO`, `UPDATE`). This ensures output order matches left-to-right reading order.
310-
- **CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output
309+
- **Name construction**`_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL, via `_bracketed_full_name`) and double-dot notation (`catalog..name`, detected by `db == ""` in the AST).
310+
- **Position sorting**`_table_start_position` reads each table identifier's character offset from sqlglot's tokenizer (`Identifier.meta['start']`). No regex scan of the raw SQL is needed — the AST already carries source positions.
311+
- **CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output.
312+
- **CREATE target placement** — for `CREATE TABLE ... AS SELECT` statements, the target table is extracted via `_extract_create_target` and prepended to the result regardless of its source position.
311313

312-
**Alias extraction**`extract_aliases` walks `exp.Table` nodes looking for aliases:
314+
**Alias extraction**`extract_aliases(tables)` walks the cached `exp.Table` nodes looking for aliases, keeping only those whose fully-qualified name appears in *tables*:
313315

314316
```sql
315317
SELECT * FROM users u JOIN orders o ON u.id = o.user_id
@@ -324,30 +326,30 @@ SELECT * FROM users u JOIN orders o ON u.id = o.user_id
324326

325327
**File:** [`nested_resolver.py`](sql_metadata/nested_resolver.py) | **Class:** `NestedResolver`
326328

327-
Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`.
329+
Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`, which passes the `Parser` class itself as a `parser_factory` callable (dependency injection) so the resolver can instantiate sub-parsers without importing `Parser` at module load time.
328330

329331
#### Four responsibilities
330332

331333
**1. Name extraction** — extract CTE and subquery names from the AST:
332334

333-
- `extract_cte_names(ast, cte_name_map)`static method, walks `exp.CTE` nodes and collects their aliases (with reverse CTE name map applied)
334-
- `extract_subquery_names(ast)` — static method, post-order walk collecting aliased `exp.Subquery` names
335+
- `extract_cte_names(cte_name_map)`instance method, walks `exp.CTE` nodes and collects their aliases (with the reverse CTE name map applied to restore dots that `SqlCleaner` replaced with `__DOT__`).
336+
- `extract_subqueries(ast)` — static method, single post-order walk that returns `(names, bodies)` together. Innermost subqueries appear first. Aliased subqueries keep their alias; unaliased ones get synthetic `subquery_N` names.
335337

336338
Called directly by `Parser.with_names` and `Parser.subqueries_names`.
337339

338340
**2. Body extraction** — render CTE/subquery AST nodes back to SQL:
339341

340-
- `extract_cte_bodies` — finds `exp.CTE` nodes in the AST, renders their body via `_PreservingGenerator`
341-
- `extract_subquery_bodies` — post-order walk so inner subqueries appear before outer ones
342-
- `_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise (e.g., keeps `IFNULL` instead of converting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`)
342+
- `extract_cte_bodies(cte_name_map)` — finds `exp.CTE` nodes in the AST and renders each body via `_PreservingGenerator`.
343+
- Subquery bodies are produced alongside their names by `extract_subqueries` — no separate body-extraction method.
344+
- `_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise: keeps `IFNULL` instead of rewriting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`, renders `DATE_ADD`/`DATE_SUB`, and preserves `IS NOT NULL` / `NOT IN` idioms.
343345

344346
**3. Column resolution**`resolve()` runs two phases:
345347

346348
```mermaid
347349
flowchart TB
348350
INPUT["columns from ColumnExtractor"]
349351
INPUT --> P1["Phase 1: _resolve_sub_queries()\nReplace subquery.column refs\nwith actual columns"]
350-
P1 --> P2["Phase 2: _resolve_bare_through_nested()\nDrop bare names that are\naliases in nested queries"]
352+
P1 --> P2["Phase 2: _resolve_unqualified_through_nested()\nDrop bare names that are\naliases in nested queries"]
351353
P2 --> OUTPUT["Resolved columns"]
352354
```
353355

@@ -364,11 +366,11 @@ SELECT label FROM cte
364366
-- "label" is an alias inside the CTE → dropped from columns, added to aliases
365367
```
366368

367-
**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver creates a new `Parser(body_sql)` for each nested query body (cached in `_subqueries_parsers` / `_with_parsers`). This means the full pipeline runs recursively for each CTE/subquery.
369+
**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver invokes `self._parser_factory(body_sql)` to build a new `Parser` for each nested body (cached in `_subqueries_parsers` / `_with_parsers`). The full pipeline runs recursively for each CTE/subquery, but the dependency is injected rather than imported.
368370

369371
#### Alias resolution with cycle detection
370372

371-
`_resolve_column_alias` follows alias chains with a `visited` set to prevent infinite loops:
373+
`resolve_column_alias` (public) and its private helper `_resolve_column_alias` follow alias chains with a `visited` set to prevent infinite loops:
372374

373375
```python
374376
# a → b → c (resolves to "c")
@@ -396,9 +398,10 @@ Maps the AST root node type to a `QueryType` enum value via `_SIMPLE_TYPE_MAP`:
396398
| `exp.Merge` | `MERGE` |
397399

398400
Special handling:
399-
- Parenthesised queries → `_unwrap_parens` strips `Paren`/`Subquery` wrappers
400-
- `exp.Command``_resolve_command_type` checks for `CREATE FUNCTION` / `ALTER`
401-
- `REPLACE INTO` → detected via `ASTParser.is_replace` flag, patched in `Parser.query_type`
401+
- A bare `exp.With` root (a `WITH` clause with no main statement) raises `InvalidQueryDefinition` — it is not valid SQL on its own.
402+
- `exp.Command``_resolve_command_type` inspects the command's `this` attribute and maps `CREATE` back to `QueryType.CREATE` so dialect-specific DDL that degrades to an opaque command still returns a useful type.
403+
- `REPLACE INTO``Parser` forwards the `ASTParser.is_replace` flag into the extractor's constructor; when the AST is `exp.Insert` and `is_replace` is true, the extractor returns `QueryType.REPLACE` directly.
404+
- Empty / comment-only SQL → `_raise_for_none_ast` distinguishes "no parseable content" (`"Empty queries are not supported!"`) from "had content but sqlglot produced no AST" (`"Could not parse the query — the SQL syntax appears to be invalid"`), both raised as `InvalidQueryDefinition`.
402405

403406
---
404407

@@ -435,6 +438,9 @@ A collection of pure stateless functions (no class). Exploits the fact that sqlg
435438
- `last_segment` — returns the last dot-separated segment of a qualified name (e.g. ``"schema.table.column"````"column"``).
436439
- `DOT_PLACEHOLDER` — encoding constant for qualified CTE names (``__DOT__``).
437440

441+
**[`exceptions.py`](sql_metadata/exceptions.py):**
442+
- `InvalidQueryDefinition` — a `ValueError` subclass raised whenever the SQL is structurally invalid (empty, unparseable, unsupported query type, alias-less CTE, or all dialects degraded). Inheriting from `ValueError` keeps existing `except ValueError:` handlers working while giving callers a specific type to catch.
443+
438444
**[`generalizator.py`](sql_metadata/generalizator.py)** — anonymises SQL for log aggregation: strips comments, replaces literals with `X`, numbers with `N`, collapses `IN(...)` lists to `(XYZ)`.
439445

440446
---
@@ -514,6 +520,7 @@ sequenceDiagram
514520
flowchart TB
515521
INIT["__init__.py"]
516522
INIT --> P["parser.py"]
523+
INIT --> EXC["exceptions.py"]
517524
518525
P --> AST["ast_parser.py"]
519526
P --> EXT["column_extractor.py"]
@@ -529,25 +536,29 @@ flowchart TB
529536
AST --> DP["dialect_parser.py"]
530537
531538
SC --> COM
539+
SC --> EXC
532540
DP --> COM
541+
DP --> EXC
533542
DP -.->|"sqlglot.parse()"| SG["sqlglot"]
534543
TAB --> DP
535544
536545
EXT -.-> SG
537546
EXT --> UT
547+
EXT --> EXC
538548
TAB -.-> SG
539549
RES -.-> SG
540550
RES --> UT
541-
RES -->|"sub-Parser\n(recursive)"| P
551+
RES -.->|"parser_factory\n(injected by Parser)"| P
542552
QT -.-> SG
543553
QT --> KW
554+
QT --> EXC
544555
COM -.->|"Tokenizer"| SG
545556
GEN --> COM
546557
547558
style SG fill:#f0f0f0,stroke:#999
548559
```
549560

550-
Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser.py` to create sub-Parser instances for nested queries. This import is deferred (inside method bodies) to avoid import-time cycles.
561+
`nested_resolver.py` needs `Parser` to recursively analyse CTE/subquery bodies, but importing `Parser` at module load would create a cycle (`parser.py` already imports `NestedResolver`). Instead, `Parser._get_resolver()` passes the `Parser` class itself into `NestedResolver.__init__` as a `parser_factory` callable — pure dependency injection. The only `parser.py` reference in `nested_resolver.py` is a `TYPE_CHECKING`-guarded import for type hints.
551562

552563
---
553564

@@ -563,4 +574,4 @@ Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser
563574

564575
**Graceful regex fallbacks** — when the AST parse fails entirely, the parser degrades to regex-based extraction for columns (INSERT INTO pattern) and LIMIT/OFFSET rather than raising an error.
565576

566-
**Recursive sub-parsing**`NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies. This reuses the entire pipeline recursively, with caching to avoid re-parsing the same body twice.
577+
**Recursive sub-parsing via dependency injection**`NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies using a `parser_factory` callable injected by `Parser._get_resolver()`. This reuses the entire pipeline recursively (with caching to avoid re-parsing the same body twice) without introducing a module-level import cycle.

0 commit comments

Comments
 (0)