Skip to content

Latest commit

 

History

History
325 lines (260 loc) · 11.3 KB

File metadata and controls

325 lines (260 loc) · 11.3 KB

Agent Notes for sql-metadata Repository

This file contains important information about the sql-metadata repository for AI agents and developers working on the codebase.

Project Overview

sql-metadata is a Python library that parses SQL queries and extracts metadata such as:

  • Tables referenced in queries
  • Columns used
  • Query types (SELECT, INSERT, UPDATE, etc.)
  • WITH clause (CTE) definitions
  • Subqueries and aliases

Technology Stack:

  • Python 3.10+
  • sqlglot library for SQL parsing and AST construction
  • Poetry for dependency management
  • pytest for testing
  • ruff for linting and formatting

Repository Structure

sql-metadata/
├── sql_metadata/                  # Main package
│   ├── parser.py                 # Public facade — Parser class
│   ├── ast_parser.py             # ASTParser — thin orchestrator, composes SqlCleaner + DialectParser
│   ├── sql_cleaner.py            # SqlCleaner — raw SQL preprocessing (no sqlglot dependency)
│   ├── dialect_parser.py         # DialectParser — dialect detection, parsing, quality validation
│   ├── column_extractor.py       # ColumnExtractor — single-pass DFS column/alias extraction
│   ├── table_extractor.py        # TableExtractor — table extraction with position sorting
│   ├── nested_resolver.py        # NestedResolver — CTE/subquery names, bodies, resolution
│   ├── query_type_extractor.py   # QueryTypeExtractor — query type detection
│   ├── comments.py               # Comment extraction/stripping (pure functions)
│   ├── keywords_lists.py         # QueryType enum
│   ├── utils.py                  # UniqueList, last_segment, shared helpers
│   ├── generalizator.py          # Query anonymisation
│   └── __init__.py               # Exports: Parser, QueryType
├── test/                          # Test suite (25 test files)
│   ├── test_with_statements.py
│   ├── test_getting_tables.py
│   ├── test_getting_columns.py
│   └── ...
├── ARCHITECTURE.md               # Detailed architecture docs with Mermaid diagrams
├── pyproject.toml                # Poetry configuration
├── Makefile                      # Common commands
└── README.md

Architecture Overview

The v3 architecture uses sqlglot to build an AST, then walks it with specialised extractor classes composed by a thin Parser facade. See ARCHITECTURE.md for detailed module deep dives, traced walkthroughs, and Mermaid diagrams.

Pipeline

Raw SQL → SqlCleaner (preprocessing)
       → DialectParser (dialect detection, sqlglot.parse())
       → sqlglot AST (cached by ASTParser)
       → TableExtractor (tables, table aliases)
       → ColumnExtractor (columns, column aliases — single-pass DFS)
       → NestedResolver (CTE/subquery names + bodies, column resolution)
       → Final metadata (cached on Parser)

Key Design Patterns

  • Composition over inheritanceParser composes ASTParser, TableExtractor, ColumnExtractor, NestedResolver, QueryTypeExtractor
  • Lazy evaluation with caching — properties compute on first access, cache the result
  • Single-pass DFSColumnExtractor walks AST in arg_types key order (mirrors SQL text order)
  • Multi-dialect retryASTParser tries several sqlglot dialects, picks first non-degraded result
  • Graceful regex fallbacks — degrades to regex when sqlglot parse fails

Class Responsibilities

Class Owns Does NOT own
Parser Facade, caching, regex fallbacks, value extraction No extraction logic
ASTParser Orchestration, lazy AST caching No preprocessing, no parsing
SqlCleaner Raw SQL preprocessing (REPLACE rewrite, comment strip, CTE normalisation) No AST, no sqlglot
DialectParser Dialect detection, sqlglot parsing, parse-quality validation No preprocessing
ColumnExtractor Column names, column aliases (during DFS walk) CTE/subquery name extraction (standalone)
TableExtractor Table names, table aliases, position sorting Nothing else
NestedResolver CTE/subquery names, CTE/subquery bodies, column resolution Column extraction
QueryTypeExtractor Query type detection Nothing else

Development Workflow

Setup

poetry install          # Install dependencies

Testing

make test              # Run all tests with pytest
poetry run pytest -vv  # Verbose test output
poetry run pytest -x   # Stop on first failure
poetry run pytest test/test_with_statements.py::test_name  # Run specific test

Linting

make lint              # Run ruff check with auto-fix
poetry run ruff check --fix sql_metadata

Code Formatting

make format            # Run ruff formatter
poetry run ruff format .

Type Checking

poetry run mypy sql_metadata

Coverage

make coverage          # Run tests with coverage report
poetry run pytest -vv --cov=sql_metadata --cov-report=term-missing

Important: The project has a 100% test coverage requirement (fail_under = 100 in pyproject.toml).

Verification after changes

After making code changes, always run all three checks:

poetry run pytest -vv --cov=sql_metadata --cov-report=term-missing  # tests + coverage
poetry run mypy sql_metadata                                         # type checking
poetry run ruff check sql_metadata                                   # linting

Code Quality Standards

Ruff Configuration (pyproject.toml)

  • Max line length: 88
  • Max complexity: 8 (C901 error for complexity > 8)
  • Enabled rule sets: E, F, W (pycodestyle/pyflakes), C90 (mccabe), I (isort)
  • Exceptions: Use # noqa: C901 for complex but necessary functions

Review Practices

Verify before grading severity

When reviewing code (or producing a critical review of a branch/PR), spike every claim before attaching a severity or a "~N LoC removable" number:

  1. Read the tests that cover the code path you're flagging — they encode the actual contract you'd be changing.
  2. If the claim is "library X already handles this", actually run X against a handful of real inputs from the codebase and confirm the output shape matches what downstream code consumes.
  3. If the claim is "N lines removable", sketch the replacement and see whether tests still pass — mentally or via a throwaway branch.
  4. Only verified claims deserve HIGH severity or concrete LoC numbers. Unverified hunches belong in a "needs investigation" list, not a severity-ranked review.

Estimates without verification give false authority to findings that may not hold up. In a past v3 review four HIGH/MEDIUM items (comment extraction, scope-based resolution, LIMIT regex, "god class" LoC grade) dissolved within minutes of actual investigation; all four would have been caught by a pre-grade spike.

Before closing a review phase, re-read every HIGH and MEDIUM finding and confirm a verification step exists in the session transcript for each one. If a spike did not happen, downgrade or drop the finding before publishing. Codifying the rule in memory is not enough — it has to be applied before the claim is formed, not consulted afterwards.

Error Handling Patterns

Malformed SQL Detection

The codebase has established patterns for handling malformed SQL:

  1. Detect the malformed pattern early
  2. Raise ValueError("This query is wrong") — This is the standard error message
  3. Use pytest.raises in tests:
parser = Parser(malformed_query)
with pytest.raises(ValueError, match="This query is wrong"):
    parser.tables

Testing Patterns

Test Organization

Tests are organized by feature/SQL clause:

  • test_with_statements.py — WITH clause (CTEs)
  • test_getting_tables.py — Table extraction
  • test_getting_columns.py — Column extraction
  • test_query_type.py — Query type detection
  • Database-specific: test_mssql_server.py, test_postgress.py, test_hive.py, etc.

Test Naming Convention

def test_descriptive_name():
    """Optional docstring explaining the test"""
    query = """SQL query here"""
    parser = Parser(query)
    assert parser.tables == ["expected", "tables"]

Testing Malformed SQL

def test_malformed_case():
    # Comment explaining what's being tested and why
    # Include issue reference if applicable: # https://github.com/macbre/sql-metadata/issues/XXX
    query = """Malformed SQL"""
    parser = Parser(query)
    with pytest.raises(ValueError, match="This query is wrong"):
        parser.tables

Test Coverage

  • Every new feature needs tests
  • Every bug fix needs a test that would have caught the bug
  • Coverage must remain at 100%

Test Comments

Reference issues in test comments:

def test_issue_fix():
    # Test for issue #556 - malformed WITH query causes infinite loop
    # https://github.com/macbre/sql-metadata/issues/556

Git Workflow

Commit Message Format

Following the established pattern:

Brief description of change

Resolves #issue-number.

Co-Authored-By: Claude <noreply@anthropic.com>

Branch Naming

  • Feature: feature/description
  • Bug fix: fix/description

Dependencies

Production

  • sqlglot (^30.0.3): SQL parsing and AST construction

Development

  • pytest (^9.0.2): Testing framework
  • pytest-cov (^7.1.0): Coverage reporting
  • ruff (^0.11): Linting and formatting
  • coverage (^7.13): Coverage measurement

Version Information

Known Patterns to Follow

1. Property Caching

Always cache property results:

@property
def my_property(self):
    if self._my_property is not None:
        return self._my_property
    self._my_property = self._compute_property()
    return self._my_property

2. Error Messages

Use consistent error messages:

  • "This query is wrong" — for malformed SQL
  • "Empty queries are not supported!" — for empty input
  • Keep messages simple and consistent with existing patterns

3. Prefer sqlglot over manual parsing

Always use sqlglot AST features (node types, find_all, arg_types traversal) rather than regex or manual string parsing when possible.

Quick Reference Commands

# Setup
poetry install

# Test
make test                                    # All tests
poetry run pytest test/test_with_statements.py -vv  # Specific file
poetry run pytest -x                         # Stop on first failure
poetry run pytest -k "test_name"            # Run by name pattern

# Quality
make lint                                    # Lint check
make format                                  # Format code
make coverage                               # Coverage report

# Debug
poetry run python -c "from sql_metadata import Parser; print(Parser('SELECT * FROM t').tables)"

Debugging Tips

Inspecting the AST

from sql_metadata import Parser
p = Parser("SELECT a FROM t")
print(p._ast_parser.ast)        # sqlglot AST tree
print(repr(p._ast_parser.ast))  # Detailed node repr

Running Single Test with Timeout

timeout 5 poetry run pytest test/test_file.py::test_name -vv

Last Updated

2026-03-31 — Rewritten for v3 architecture (sqlglot-based, class extractors)