macbre
diff --git a/‎AGENTS.md‎
Lines changed: 30 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 46 additions & 35 deletions b/‎ARCHITECTURE.md‎
Lines changed: 46 additions & 35 deletions
@@ -140,6 +140,36 @@ poetry run ruff check sql_metadata                                   # linting
 - Enabled rule sets: E, F, W (pycodestyle/pyflakes), C90 (mccabe), I (isort)
 - Exceptions: Use `# noqa: C901` for complex but necessary functions
 
+## Review Practices
+
+### Verify before grading severity
+
+When reviewing code (or producing a critical review of a branch/PR), **spike
+every claim before attaching a severity or a "~N LoC removable" number**:
+
+1. Read the tests that cover the code path you're flagging — they encode
+   the actual contract you'd be changing.
+2. If the claim is "library X already handles this", actually run X against
+   a handful of real inputs from the codebase and confirm the output shape
+   matches what downstream code consumes.
+3. If the claim is "N lines removable", sketch the replacement and see
+   whether tests still pass — mentally or via a throwaway branch.
+4. Only verified claims deserve HIGH severity or concrete LoC numbers.
+   Unverified hunches belong in a "needs investigation" list, not a
+   severity-ranked review.
+
+Estimates without verification give false authority to findings that may
+not hold up.  In a past v3 review four HIGH/MEDIUM items (comment
+extraction, scope-based resolution, LIMIT regex, "god class" LoC grade)
+dissolved within minutes of actual investigation; all four would have
+been caught by a pre-grade spike.
+
+Before closing a review phase, re-read every HIGH and MEDIUM finding and
+confirm a verification step exists in the session transcript for each
+one.  If a spike did not happen, downgrade or drop the finding before
+publishing.  Codifying the rule in memory is not enough — it has to be
+applied *before* the claim is formed, not consulted afterwards.
+
 ## Error Handling Patterns
 
 ### Malformed SQL Detection
 
@@ -17,6 +17,7 @@ sql-metadata v3 is a Python library that parses SQL queries and extracts metadat
 | [`comments.py`](sql_metadata/comments.py) | Comment extraction/stripping via tokenizer gaps | `extract_comments`, `strip_comments` |
 | [`keywords_lists.py`](sql_metadata/keywords_lists.py) | `QueryType` enum | — |
 | [`utils.py`](sql_metadata/utils.py) | `UniqueList` (deduplicating list), `last_segment`, `DOT_PLACEHOLDER` | — |
+| [`exceptions.py`](sql_metadata/exceptions.py) | Custom exception hierarchy | `InvalidQueryDefinition` |
 | [`generalizator.py`](sql_metadata/generalizator.py) | Query anonymisation for log aggregation | `Generalizator` |
 
 ---
@@ -126,7 +127,7 @@ def tables(self) -> List[str]:
     return self._tables
 ```
 
-**Regex fallbacks** — when `sqlglot.parse()` fails (raises `ValueError`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than raising an error.
+**Regex fallbacks** — when `sqlglot.parse()` fails (raises `InvalidQueryDefinition`), the parser falls back to regex extraction for columns (`_extract_columns_regex`) and LIMIT/OFFSET (`_extract_limit_regex`) rather than propagating the error. `InvalidQueryDefinition` is a `ValueError` subclass defined in [`exceptions.py`](sql_metadata/exceptions.py) — catching `ValueError` still works for external callers.
 
 ---
 
@@ -204,7 +205,7 @@ flowchart TD
 1. Parse with `sqlglot.parse()` (warnings suppressed)
 2. Check for degradation via `_is_degraded` — phantom tables (`IGNORE`, `""`), keyword-as-column names (`UNIQUE`, `DISTINCT`)
 3. If degraded and not the last dialect, try the next one
-4. If all fail, raise `ValueError("This query is wrong")`
+4. If all fail, raise `InvalidQueryDefinition` (a `ValueError` subclass from [`exceptions.py`](sql_metadata/exceptions.py))
 
 ---
 
@@ -237,22 +238,24 @@ flowchart TB
 
 #### DFS dispatch
 
-The walk visits each node and dispatches to specialised handlers:
+The walk visits each node and routes it through `_dispatch_leaf`, which calls a specialised handler or inline branch depending on the node type:
 
-| AST Node Type | Handler | What it does |
+| AST Node Type | Routing | What happens |
 |---------------|---------|-------------|
-| `exp.Star` | `_handle_star` | Adds `*` (skips if inside function like `COUNT(*)`) |
-| `exp.ColumnDef` | (inline) | Adds column name for CREATE TABLE DDL |
-| `exp.Identifier` | `_handle_identifier` | Adds column if in JOIN USING context |
+| `exp.Star` | inline in `_dispatch_leaf` | Adds `*` (skips if inside a function like `COUNT(*)`) |
+| `exp.ColumnDef` | inline in `_dispatch_leaf` | Adds column name for CREATE TABLE DDL |
+| `exp.Identifier` | inline in `_dispatch_leaf` | Adds column if in JOIN USING context |
 | `exp.CTE` | `_handle_cte` | Records CTE name, processes column definitions |
 | `exp.Column` | `_handle_column` | Main handler — resolves table alias, builds full name |
-| `exp.Subquery` (aliased) | (inline) | Records subquery name and depth for ordering |
+| `exp.Subquery` (aliased) | inline in `_dispatch_leaf` | Records subquery name and depth for ordering |
 
 **Special processing** in `_process_child_key`:
 - SELECT expressions → `_handle_select_exprs` → iterates expressions, detects aliases
 - INSERT schema → `_handle_insert_schema` → extracts column list from `INSERT INTO t(col1, col2)`
 - JOIN USING → `_handle_join_using` → extracts column identifiers
 
+**Error handling** — `_handle_cte` raises `InvalidQueryDefinition` if a `WITH` clause contains an alias-less CTE (invalid SQL).
+
 #### Clause classification
 
 `_classify_clause` maps each `arg_types` key to a `columns_dict` section:
@@ -285,31 +288,30 @@ The walk visits each node and dispatches to specialised handlers:
 
 **File:** [`table_extractor.py`](sql_metadata/table_extractor.py) | **Class:** `TableExtractor`
 
-Walks the AST for `exp.Table` and `exp.Lateral` nodes, builds fully-qualified table names, and sorts results by first occurrence in the raw SQL.
+Walks the AST for `exp.Table` nodes, builds fully-qualified table names, and sorts results by each table identifier's character position recorded by sqlglot's tokenizer.
 
 #### Extraction flow
 
 ```mermaid
 flowchart TB
-    AST["sqlglot AST"] --> CHECK{"exp.Command?"}
-    CHECK -->|Yes| REGEX["Regex fallback\n(_extract_tables_from_command)"]
-    CHECK -->|No| CREATE{"exp.Create?"}
-    CREATE -->|Yes| TARGET["Extract CREATE target"]
+    AST["sqlglot AST"] --> CREATE{"exp.Create?"}
+    CREATE -->|Yes| TARGET["_extract_create_target()\nTarget goes first"]
     CREATE -->|No| SKIP["skip"]
     TARGET --> COLLECT
-    SKIP --> COLLECT["_collect_all()\nWalk exp.Table + exp.Lateral"]
+    SKIP --> COLLECT["_table_nodes()\nfind_all(exp.Table), cached"]
     COLLECT --> FILTER["Filter out CTE names"]
-    FILTER --> SORT["Sort by _first_position()\n(regex in raw SQL)"]
-    SORT --> ORDER["_place_tables_in_order()\nCREATE target goes first"]
+    FILTER --> SORT["Sort by _table_start_position()\n(Identifier.meta['start'])"]
+    SORT --> FINAL["UniqueList:\nCREATE target + sorted tables"]
 ```
 
 **Key algorithms:**
 
-- **Name construction** — `_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL) and double-dot notation (`catalog..name`)
-- **Position sorting** — `_first_position` finds each table name in the raw SQL via regex, preferring matches after table-introducing keywords (`FROM`, `JOIN`, `TABLE`, `INTO`, `UPDATE`). This ensures output order matches left-to-right reading order.
-- **CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output
+- **Name construction** — `_table_full_name` assembles `catalog.db.name`, with special handling for bracket mode (TSQL, via `_bracketed_full_name`) and double-dot notation (`catalog..name`, detected by `db == ""` in the AST).
+- **Position sorting** — `_table_start_position` reads each table identifier's character offset from sqlglot's tokenizer (`Identifier.meta['start']`). No regex scan of the raw SQL is needed — the AST already carries source positions.
+- **CTE filtering** — table names matching known CTE names are excluded, so only real tables appear in the output.
+- **CREATE target placement** — for `CREATE TABLE ... AS SELECT` statements, the target table is extracted via `_extract_create_target` and prepended to the result regardless of its source position.
 
-**Alias extraction** — `extract_aliases` walks `exp.Table` nodes looking for aliases:
+**Alias extraction** — `extract_aliases(tables)` walks the cached `exp.Table` nodes looking for aliases, keeping only those whose fully-qualified name appears in *tables*:
 
 ```sql
 SELECT * FROM users u JOIN orders o ON u.id = o.user_id
@@ -324,30 +326,30 @@ SELECT * FROM users u JOIN orders o ON u.id = o.user_id
 
 **File:** [`nested_resolver.py`](sql_metadata/nested_resolver.py) | **Class:** `NestedResolver`
 
-Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`.
+Handles the complete "look inside nested queries" concern. Created lazily by `Parser._get_resolver()`, which passes the `Parser` class itself as a `parser_factory` callable (dependency injection) so the resolver can instantiate sub-parsers without importing `Parser` at module load time.
 
 #### Four responsibilities
 
 **1. Name extraction** — extract CTE and subquery names from the AST:
 
-- `extract_cte_names(ast, cte_name_map)` — static method, walks `exp.CTE` nodes and collects their aliases (with reverse CTE name map applied)
-- `extract_subquery_names(ast)` — static method, post-order walk collecting aliased `exp.Subquery` names
+- `extract_cte_names(cte_name_map)` — instance method, walks `exp.CTE` nodes and collects their aliases (with the reverse CTE name map applied to restore dots that `SqlCleaner` replaced with `__DOT__`).
+- `extract_subqueries(ast)` — static method, single post-order walk that returns `(names, bodies)` together. Innermost subqueries appear first. Aliased subqueries keep their alias; unaliased ones get synthetic `subquery_N` names.
 
 Called directly by `Parser.with_names` and `Parser.subqueries_names`.
 
 **2. Body extraction** — render CTE/subquery AST nodes back to SQL:
 
-- `extract_cte_bodies` — finds `exp.CTE` nodes in the AST, renders their body via `_PreservingGenerator`
-- `extract_subquery_bodies` — post-order walk so inner subqueries appear before outer ones
-- `_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise (e.g., keeps `IFNULL` instead of converting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`)
+- `extract_cte_bodies(cte_name_map)` — finds `exp.CTE` nodes in the AST and renders each body via `_PreservingGenerator`.
+- Subquery bodies are produced alongside their names by `extract_subqueries` — no separate body-extraction method.
+- `_PreservingGenerator` — custom sqlglot `Generator` that preserves function signatures sqlglot would normalise: keeps `IFNULL` instead of rewriting to `COALESCE`, keeps `DIV` instead of `CAST(... / ... AS INT)`, renders `DATE_ADD`/`DATE_SUB`, and preserves `IS NOT NULL` / `NOT IN` idioms.
 
 **3. Column resolution** — `resolve()` runs two phases:
 
 ```mermaid
 flowchart TB
     INPUT["columns from ColumnExtractor"]
     INPUT --> P1["Phase 1: _resolve_sub_queries()\nReplace subquery.column refs\nwith actual columns"]
-    P1 --> P2["Phase 2: _resolve_bare_through_nested()\nDrop bare names that are\naliases in nested queries"]
+    P1 --> P2["Phase 2: _resolve_unqualified_through_nested()\nDrop bare names that are\naliases in nested queries"]
     P2 --> OUTPUT["Resolved columns"]
 ```
 
@@ -364,11 +366,11 @@ SELECT label FROM cte
 -- "label" is an alias inside the CTE → dropped from columns, added to aliases
 ```
 
-**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver creates a new `Parser(body_sql)` for each nested query body (cached in `_subqueries_parsers` / `_with_parsers`). This means the full pipeline runs recursively for each CTE/subquery.
+**4. Recursive sub-Parser instantiation** — when resolving `subquery.column`, the resolver invokes `self._parser_factory(body_sql)` to build a new `Parser` for each nested body (cached in `_subqueries_parsers` / `_with_parsers`). The full pipeline runs recursively for each CTE/subquery, but the dependency is injected rather than imported.
 
 #### Alias resolution with cycle detection
 
-`_resolve_column_alias` follows alias chains with a `visited` set to prevent infinite loops:
+`resolve_column_alias` (public) and its private helper `_resolve_column_alias` follow alias chains with a `visited` set to prevent infinite loops:
 
 ```python
 # a → b → c (resolves to "c")
@@ -396,9 +398,10 @@ Maps the AST root node type to a `QueryType` enum value via `_SIMPLE_TYPE_MAP`:
 | `exp.Merge` | `MERGE` |
 
 Special handling:
-- Parenthesised queries → `_unwrap_parens` strips `Paren`/`Subquery` wrappers
-- `exp.Command` → `_resolve_command_type` checks for `CREATE FUNCTION` / `ALTER`
-- `REPLACE INTO` → detected via `ASTParser.is_replace` flag, patched in `Parser.query_type`
+- A bare `exp.With` root (a `WITH` clause with no main statement) raises `InvalidQueryDefinition` — it is not valid SQL on its own.
+- `exp.Command` → `_resolve_command_type` inspects the command's `this` attribute and maps `CREATE` back to `QueryType.CREATE` so dialect-specific DDL that degrades to an opaque command still returns a useful type.
+- `REPLACE INTO` → `Parser` forwards the `ASTParser.is_replace` flag into the extractor's constructor; when the AST is `exp.Insert` and `is_replace` is true, the extractor returns `QueryType.REPLACE` directly.
+- Empty / comment-only SQL → `_raise_for_none_ast` distinguishes "no parseable content" (`"Empty queries are not supported!"`) from "had content but sqlglot produced no AST" (`"Could not parse the query — the SQL syntax appears to be invalid"`), both raised as `InvalidQueryDefinition`.
 
 ---
 
@@ -435,6 +438,9 @@ A collection of pure stateless functions (no class). Exploits the fact that sqlg
 - `last_segment` — returns the last dot-separated segment of a qualified name (e.g. ``"schema.table.column"`` → ``"column"``).
 - `DOT_PLACEHOLDER` — encoding constant for qualified CTE names (``__DOT__``).
 
+**[`exceptions.py`](sql_metadata/exceptions.py):**
+- `InvalidQueryDefinition` — a `ValueError` subclass raised whenever the SQL is structurally invalid (empty, unparseable, unsupported query type, alias-less CTE, or all dialects degraded). Inheriting from `ValueError` keeps existing `except ValueError:` handlers working while giving callers a specific type to catch.
+
 **[`generalizator.py`](sql_metadata/generalizator.py)** — anonymises SQL for log aggregation: strips comments, replaces literals with `X`, numbers with `N`, collapses `IN(...)` lists to `(XYZ)`.
 
 ---
@@ -514,6 +520,7 @@ sequenceDiagram
 flowchart TB
     INIT["__init__.py"]
     INIT --> P["parser.py"]
+    INIT --> EXC["exceptions.py"]
 
     P --> AST["ast_parser.py"]
     P --> EXT["column_extractor.py"]
@@ -529,25 +536,29 @@ flowchart TB
     AST --> DP["dialect_parser.py"]
 
     SC --> COM
+    SC --> EXC
     DP --> COM
+    DP --> EXC
     DP -.->|"sqlglot.parse()"| SG["sqlglot"]
     TAB --> DP
 
     EXT -.-> SG
     EXT --> UT
+    EXT --> EXC
     TAB -.-> SG
     RES -.-> SG
     RES --> UT
-    RES -->|"sub-Parser\n(recursive)"| P
+    RES -.->|"parser_factory\n(injected by Parser)"| P
     QT -.-> SG
     QT --> KW
+    QT --> EXC
     COM -.->|"Tokenizer"| SG
     GEN --> COM
 
     style SG fill:#f0f0f0,stroke:#999
 ```
 
-Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser.py` to create sub-Parser instances for nested queries. This import is deferred (inside method bodies) to avoid import-time cycles.
+`nested_resolver.py` needs `Parser` to recursively analyse CTE/subquery bodies, but importing `Parser` at module load would create a cycle (`parser.py` already imports `NestedResolver`). Instead, `Parser._get_resolver()` passes the `Parser` class itself into `NestedResolver.__init__` as a `parser_factory` callable — pure dependency injection. The only `parser.py` reference in `nested_resolver.py` is a `TYPE_CHECKING`-guarded import for type hints.
 
 ---
 
@@ -563,4 +574,4 @@ Note the circular dependency: `nested_resolver.py` imports `Parser` from `parser
 
 **Graceful regex fallbacks** — when the AST parse fails entirely, the parser degrades to regex-based extraction for columns (INSERT INTO pattern) and LIMIT/OFFSET rather than raising an error.
 
-**Recursive sub-parsing** — `NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies. This reuses the entire pipeline recursively, with caching to avoid re-parsing the same body twice.
+**Recursive sub-parsing via dependency injection** — `NestedResolver` creates fresh `Parser` instances for CTE/subquery bodies using a `parser_factory` callable injected by `Parser._get_resolver()`. This reuses the entire pipeline recursively (with caching to avoid re-parsing the same body twice) without introducing a module-level import cycle.