You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Regex fallbacks** — when `sqlglot.parse()` fails (raises `ValueError`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than raising an error.
130
+
**Regex fallbacks** — when `sqlglot.parse()` fails (raises `InvalidQueryDefinition`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than propagating the error. `InvalidQueryDefinition` is a `ValueError` subclass defined in [`exceptions.py`](sql_metadata/exceptions.py) — catching `ValueError` still works for external callers.
130
131
131
132
---
132
133
@@ -204,7 +205,7 @@ flowchart TD
204
205
1. Parse with `sqlglot.parse()` (warnings suppressed)
205
206
2. Check for degradation via `_is_degraded` — phantom tables (`IGNORE`, `""`), keyword-as-column names (`UNIQUE`, `DISTINCT`)
206
207
3. If degraded and not the last dialect, try the next one
207
-
4. If all fail, raise `ValueError("This query is wrong")`
208
+
4. If all fail, raise `InvalidQueryDefinition` (a `ValueError` subclass from [`exceptions.py`](sql_metadata/exceptions.py))
208
209
209
210
---
210
211
@@ -237,22 +238,24 @@ flowchart TB
237
238
238
239
#### DFS dispatch
239
240
240
-
The walk visits each node and dispatches to specialised handlers:
241
+
The walk visits each node and routes it through `_dispatch_leaf`, which calls a specialised handler or inline branch depending on the node type:
241
242
242
-
| AST Node Type |Handler| What it does|
243
+
| AST Node Type |Routing| What happens|
243
244
|---------------|---------|-------------|
244
-
|`exp.Star`|`_handle_star`| Adds `*` (skips if inside function like `COUNT(*)`) |
245
-
|`exp.ColumnDef`|(inline)| Adds column name for CREATE TABLE DDL |
246
-
|`exp.Identifier`|`_handle_identifier`| Adds column if in JOIN USING context |
245
+
|`exp.Star`|inline in `_dispatch_leaf`| Adds `*` (skips if inside a function like `COUNT(*)`) |
246
+
|`exp.ColumnDef`| inline in `_dispatch_leaf`| Adds column name for CREATE TABLE DDL |
247
+
|`exp.Identifier`|inline in `_dispatch_leaf`| Adds column if in JOIN USING context |
247
248
|`exp.CTE`|`_handle_cte`| Records CTE name, processes column definitions |
248
249
|`exp.Column`|`_handle_column`| Main handler — resolves table alias, builds full name |
249
-
|`exp.Subquery` (aliased) |(inline)| Records subquery name and depth for ordering |
250
+
|`exp.Subquery` (aliased) | inline in `_dispatch_leaf`| Records subquery name and depth for ordering |
Walks the AST for `exp.Table`and `exp.Lateral`nodes, builds fully-qualified table names, and sorts results by first occurrence in the raw SQL.
291
+
Walks the AST for `exp.Table` nodes, builds fully-qualified table names, and sorts results by each table identifier's character position recorded by sqlglot's tokenizer.
-**Name construction** — `_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL) and double-dot notation (`catalog..name`)
309
-
-**Position sorting** — `_first_position` finds each table name in the raw SQL via regex, preferring matches after table-introducing keywords (`FROM`, `JOIN`, `TABLE`, `INTO`, `UPDATE`). This ensures output order matches left-to-right reading order.
310
-
-**CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output
309
+
-**Name construction** — `_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL, via `_bracketed_full_name`) and double-dot notation (`catalog..name`, detected by `db == ""` in the AST).
310
+
-**Position sorting** — `_table_start_position` reads each table identifier's character offset from sqlglot's tokenizer (`Identifier.meta['start']`). No regex scan of the raw SQL is needed — the AST already carries source positions.
311
+
-**CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output.
312
+
-**CREATE target placement** — for `CREATE TABLE ... AS SELECT` statements, the target table is extracted via `_extract_create_target` and prepended to the result regardless of its source position.
311
313
312
-
**Alias extraction** — `extract_aliases` walks `exp.Table` nodes looking for aliases:
314
+
**Alias extraction** — `extract_aliases(tables)` walks the cached `exp.Table` nodes looking for aliases, keeping only those whose fully-qualified name appears in *tables*:
313
315
314
316
```sql
315
317
SELECT*FROM users u JOIN orders o ONu.id=o.user_id
@@ -324,30 +326,30 @@ SELECT * FROM users u JOIN orders o ON u.id = o.user_id
Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`.
329
+
Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`, which passes the `Parser` class itself as a `parser_factory` callable (dependency injection) so the resolver can instantiate sub-parsers without importing `Parser` at module load time.
328
330
329
331
#### Four responsibilities
330
332
331
333
**1. Name extraction** — extract CTE and subquery names from the AST:
332
334
333
-
-`extract_cte_names(ast, cte_name_map)` — static method, walks `exp.CTE` nodes and collects their aliases (with reverse CTE name map applied)
334
-
-`extract_subquery_names(ast)` — static method, post-order walk collecting aliased `exp.Subquery` names
335
+
-`extract_cte_names(cte_name_map)` — instance method, walks `exp.CTE` nodes and collects their aliases (with the reverse CTE name map applied to restore dots that `SqlCleaner` replaced with `__DOT__`).
336
+
-`extract_subqueries(ast)` — static method, single post-order walk that returns `(names, bodies)` together. Innermost subqueries appear first. Aliased subqueries keep their alias; unaliased ones get synthetic `subquery_N` names.
335
337
336
338
Called directly by `Parser.with_names` and `Parser.subqueries_names`.
337
339
338
340
**2. Body extraction** — render CTE/subquery AST nodes back to SQL:
339
341
340
-
-`extract_cte_bodies` — finds `exp.CTE` nodes in the AST, renders their body via `_PreservingGenerator`
341
-
-`extract_subquery_bodies` — post-order walk so inner subqueries appear before outer ones
342
-
-`_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise (e.g., keeps `IFNULL` instead of converting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`)
342
+
-`extract_cte_bodies(cte_name_map)` — finds `exp.CTE` nodes in the AST and renders each body via `_PreservingGenerator`.
343
+
-Subquery bodies are produced alongside their names by `extract_subqueries` — no separate body-extraction method.
344
+
-`_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise: keeps `IFNULL` instead of rewriting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`, renders `DATE_ADD`/`DATE_SUB`, and preserves `IS NOT NULL` / `NOT IN` idioms.
343
345
344
346
**3. Column resolution** — `resolve()` runs two phases:
345
347
346
348
```mermaid
347
349
flowchart TB
348
350
INPUT["columns from ColumnExtractor"]
349
351
INPUT --> P1["Phase 1: _resolve_sub_queries()\nReplace subquery.column refs\nwith actual columns"]
350
-
P1 --> P2["Phase 2: _resolve_bare_through_nested()\nDrop bare names that are\naliases in nested queries"]
352
+
P1 --> P2["Phase 2: _resolve_unqualified_through_nested()\nDrop bare names that are\naliases in nested queries"]
351
353
P2 --> OUTPUT["Resolved columns"]
352
354
```
353
355
@@ -364,11 +366,11 @@ SELECT label FROM cte
364
366
-- "label" is an alias inside the CTE → dropped from columns, added to aliases
365
367
```
366
368
367
-
**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver creates a new `Parser(body_sql)` for each nested query body (cached in `_subqueries_parsers` / `_with_parsers`). This means the full pipeline runs recursively for each CTE/subquery.
369
+
**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver invokes `self._parser_factory(body_sql)` to build a new `Parser` for each nested body (cached in `_subqueries_parsers` / `_with_parsers`). The full pipeline runs recursively for each CTE/subquery, but the dependency is injected rather than imported.
368
370
369
371
#### Alias resolution with cycle detection
370
372
371
-
`_resolve_column_alias`follows alias chains with a `visited` set to prevent infinite loops:
373
+
`resolve_column_alias` (public) and its private helper `_resolve_column_alias`follow alias chains with a `visited` set to prevent infinite loops:
372
374
373
375
```python
374
376
# a → b → c (resolves to "c")
@@ -396,9 +398,10 @@ Maps the AST root node type to a `QueryType` enum value via `_SIMPLE_TYPE_MAP`:
-`exp.Command` → `_resolve_command_type` checks for `CREATE FUNCTION` / `ALTER`
401
-
-`REPLACE INTO` → detected via `ASTParser.is_replace` flag, patched in `Parser.query_type`
401
+
- A bare `exp.With` root (a `WITH` clause with no main statement) raises `InvalidQueryDefinition` — it is not valid SQL on its own.
402
+
-`exp.Command` → `_resolve_command_type` inspects the command's `this` attribute and maps `CREATE` back to `QueryType.CREATE` so dialect-specific DDL that degrades to an opaque command still returns a useful type.
403
+
-`REPLACE INTO` → `Parser` forwards the `ASTParser.is_replace` flag into the extractor's constructor; when the AST is `exp.Insert` and `is_replace` is true, the extractor returns `QueryType.REPLACE` directly.
404
+
- Empty / comment-only SQL → `_raise_for_none_ast` distinguishes "no parseable content" (`"Empty queries are not supported!"`) from "had content but sqlglot produced no AST" (`"Could not parse the query — the SQL syntax appears to be invalid"`), both raised as `InvalidQueryDefinition`.
402
405
403
406
---
404
407
@@ -435,6 +438,9 @@ A collection of pure stateless functions (no class). Exploits the fact that sqlg
435
438
-`last_segment` — returns the last dot-separated segment of a qualified name (e.g. ``"schema.table.column"`` → ``"column"``).
436
439
-`DOT_PLACEHOLDER` — encoding constant for qualified CTE names (``__DOT__``).
-`InvalidQueryDefinition` — a `ValueError` subclass raised whenever the SQL is structurally invalid (empty, unparseable, unsupported query type, alias-less CTE, or all dialects degraded). Inheriting from `ValueError` keeps existing `except ValueError:` handlers working while giving callers a specific type to catch.
443
+
438
444
**[`generalizator.py`](sql_metadata/generalizator.py)** — anonymises SQL for log aggregation: strips comments, replaces literals with `X`, numbers with `N`, collapses `IN(...)` lists to `(XYZ)`.
439
445
440
446
---
@@ -514,6 +520,7 @@ sequenceDiagram
514
520
flowchart TB
515
521
INIT["__init__.py"]
516
522
INIT --> P["parser.py"]
523
+
INIT --> EXC["exceptions.py"]
517
524
518
525
P --> AST["ast_parser.py"]
519
526
P --> EXT["column_extractor.py"]
@@ -529,25 +536,29 @@ flowchart TB
529
536
AST --> DP["dialect_parser.py"]
530
537
531
538
SC --> COM
539
+
SC --> EXC
532
540
DP --> COM
541
+
DP --> EXC
533
542
DP -.->|"sqlglot.parse()"| SG["sqlglot"]
534
543
TAB --> DP
535
544
536
545
EXT -.-> SG
537
546
EXT --> UT
547
+
EXT --> EXC
538
548
TAB -.-> SG
539
549
RES -.-> SG
540
550
RES --> UT
541
-
RES -->|"sub-Parser\n(recursive)"| P
551
+
RES -.->|"parser_factory\n(injected by Parser)"| P
542
552
QT -.-> SG
543
553
QT --> KW
554
+
QT --> EXC
544
555
COM -.->|"Tokenizer"| SG
545
556
GEN --> COM
546
557
547
558
style SG fill:#f0f0f0,stroke:#999
548
559
```
549
560
550
-
Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser.py` to create sub-Parser instances for nested queries. This import is deferred (inside method bodies) to avoid import-time cycles.
561
+
`nested_resolver.py` needs `Parser` to recursively analyse CTE/subquery bodies, but importing `Parser` at module load would create a cycle (`parser.py`already imports `NestedResolver`). Instead, `Parser._get_resolver()` passes the `Parser` class itself into `NestedResolver.__init__` as a `parser_factory` callable — pure dependency injection. The only `parser.py` reference in `nested_resolver.py` is a `TYPE_CHECKING`-guarded import for type hints.
551
562
552
563
---
553
564
@@ -563,4 +574,4 @@ Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser
563
574
564
575
**Graceful regex fallbacks** — when the AST parse fails entirely, the parser degrades to regex-based extraction for columns (INSERT INTO pattern) and LIMIT/OFFSET rather than raising an error.
565
576
566
-
**Recursive sub-parsing** — `NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies. This reuses the entire pipeline recursively, with caching to avoid re-parsing the same body twice.
577
+
**Recursive sub-parsing via dependency injection** — `NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies using a `parser_factory` callable injected by `Parser._get_resolver()`. This reuses the entire pipeline recursively (with caching to avoid re-parsing the same body twice) without introducing a module-level import cycle.
0 commit comments