Skip to content

Commit b6473f9

Browse files
committed
Add model generation from raw SQL queries
Extended coverage analyzer to bootstrap semantic layers from raw SQL: Features: - Generate model definitions from query analysis - Create rewritten queries using semantic layer - CLI flag --generate-models to output both models and queries Usage: # Bootstrap from raw queries sidemantic coverage --queries raw_queries/ --generate-models output/ This generates: - output/models/*.yml - Model definitions (dimensions + metrics) - output/rewritten_queries/*.py - Python code using semantic layer Example included in examples/coverage_analysis/ with 8 sample queries covering: - Single table aggregations - Multi-dimensional grouping - Cross-model joins - Missing tables (for gap analysis) Perfect for migrating from raw SQL to semantic layer.
1 parent cf583ec commit b6473f9

11 files changed

Lines changed: 474 additions & 32 deletions
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Coverage Analysis Example
2+
3+
This example demonstrates how to bootstrap a semantic layer from raw SQL queries.
4+
5+
## Directory Structure
6+
7+
```
8+
coverage_analysis/
9+
├── raw_queries/ # Raw SQL queries from your application
10+
│ ├── revenue_by_status.sql
11+
│ ├── customer_demographics.sql
12+
│ ├── product_performance.sql
13+
│ ├── monthly_trends.sql
14+
│ ├── high_value_orders.sql
15+
│ ├── customer_orders.sql
16+
│ ├── inventory_analysis.sql
17+
│ └── cancelled_orders.sql
18+
└── README.md
19+
```
20+
21+
## Usage
22+
23+
### Bootstrap Semantic Layer from Queries
24+
25+
Generate model definitions and rewritten queries from your raw SQL:
26+
27+
```bash
28+
cd examples/coverage_analysis
29+
30+
# Generate models and rewritten queries
31+
uv run sidemantic coverage --queries raw_queries/ --generate-models output/
32+
```
33+
34+
This will create:
35+
- `output/models/` - YAML model definitions for each table
36+
- `output/rewritten_queries/` - Python code showing how to query using the semantic layer
37+
38+
### Analyze Coverage
39+
40+
If you already have a semantic layer, analyze which queries can be rewritten:
41+
42+
```bash
43+
# Compare queries against existing semantic layer
44+
uv run sidemantic coverage models/ --queries raw_queries/
45+
46+
# Show detailed analysis for each query
47+
uv run sidemantic coverage models/ --queries raw_queries/ --verbose
48+
```
49+
50+
## What Gets Generated
51+
52+
### Model Definitions
53+
54+
From queries like:
55+
```sql
56+
SELECT status, SUM(total_amount), COUNT(*)
57+
FROM orders
58+
GROUP BY status
59+
```
60+
61+
Generates models like:
62+
```yaml
63+
model:
64+
name: orders
65+
table: orders
66+
description: Auto-generated from query analysis
67+
dimensions:
68+
- name: status
69+
sql: status
70+
type: categorical
71+
metrics:
72+
- name: count
73+
agg: count
74+
sql: '*'
75+
- name: sum_total_amount
76+
agg: sum
77+
sql: total_amount
78+
```
79+
80+
### Rewritten Queries
81+
82+
Generates Python code to replace raw SQL:
83+
```python
84+
# Original query:
85+
# SELECT status, SUM(total_amount), COUNT(*)
86+
# FROM orders
87+
# GROUP BY status
88+
89+
result = layer.query(
90+
dimensions=['orders.status'],
91+
metrics=['orders.count', 'orders.sum_total_amount']
92+
)
93+
```
94+
95+
## Use Cases
96+
97+
1. **Migration** - Bootstrap semantic layer from existing SQL queries
98+
2. **Discovery** - Find what metrics/dimensions your team actually uses
99+
3. **Standardization** - Identify inconsistent business logic across queries
100+
4. **Coverage** - Track how much of your SQL can be replaced with semantic layer
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
-- Cancelled orders analysis
2+
SELECT
3+
cancellation_reason,
4+
COUNT(*) as cancelled_count,
5+
SUM(total_amount) as lost_revenue,
6+
AVG(total_amount) as avg_order_value
7+
FROM orders
8+
WHERE status = 'cancelled'
9+
GROUP BY cancellation_reason
10+
ORDER BY cancelled_count DESC;
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
-- Customer demographics analysis
2+
SELECT
3+
region,
4+
age_group,
5+
COUNT(*) as customer_count,
6+
AVG(total_spent) as avg_lifetime_value
7+
FROM customers
8+
GROUP BY region, age_group
9+
ORDER BY customer_count DESC;
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
-- Customer order patterns (cross-model query)
2+
SELECT
3+
c.region,
4+
c.customer_segment,
5+
COUNT(o.order_id) as order_count,
6+
SUM(o.total_amount) as total_spent,
7+
AVG(o.total_amount) as avg_order_value
8+
FROM customers c
9+
JOIN orders o ON c.customer_id = o.customer_id
10+
WHERE o.status = 'completed'
11+
GROUP BY c.region, c.customer_segment
12+
ORDER BY total_spent DESC;
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
-- High value orders analysis
2+
SELECT
3+
status,
4+
payment_method,
5+
COUNT(*) as order_count,
6+
AVG(total_amount) as avg_order_value,
7+
MAX(total_amount) as max_order_value
8+
FROM orders
9+
WHERE total_amount > 500
10+
GROUP BY status, payment_method
11+
ORDER BY avg_order_value DESC;
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
-- Inventory and sales analysis (table not in semantic layer)
2+
SELECT
3+
warehouse_location,
4+
product_category,
5+
SUM(quantity_in_stock) as total_inventory,
6+
SUM(quantity_sold) as total_sold,
7+
AVG(reorder_point) as avg_reorder_point
8+
FROM inventory
9+
GROUP BY warehouse_location, product_category
10+
ORDER BY total_inventory DESC;
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
-- Monthly revenue trends
2+
SELECT
3+
DATE_TRUNC('month', order_date) as month,
4+
COUNT(*) as order_count,
5+
SUM(total_amount) as revenue,
6+
COUNT(DISTINCT customer_id) as unique_customers
7+
FROM orders
8+
WHERE order_date >= '2024-01-01'
9+
GROUP BY DATE_TRUNC('month', order_date)
10+
ORDER BY month;
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
-- Product sales performance
2+
SELECT
3+
category,
4+
brand,
5+
COUNT(DISTINCT product_id) as product_count,
6+
SUM(units_sold) as total_units,
7+
SUM(revenue) as total_revenue,
8+
AVG(price) as avg_price
9+
FROM products
10+
GROUP BY category, brand
11+
HAVING SUM(revenue) > 10000
12+
ORDER BY total_revenue DESC
13+
LIMIT 20;
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
-- Total revenue by order status
2+
SELECT
3+
status,
4+
SUM(total_amount) as total_revenue,
5+
COUNT(*) as order_count
6+
FROM orders
7+
GROUP BY status
8+
ORDER BY total_revenue DESC;

sidemantic/cli.py

Lines changed: 93 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -274,69 +274,130 @@ def info(
274274

275275
@app.command()
276276
def coverage(
277-
directory: Path = typer.Argument(..., help="Directory containing semantic layer files"),
277+
directory: Path = typer.Argument(
278+
None, help="Directory containing semantic layer files (optional if using --generate-models)"
279+
),
278280
queries: Path = typer.Option(
279281
None, "--queries", "-q", help="Path to file or folder containing SQL queries to analyze"
280282
),
281283
verbose: bool = typer.Option(False, "--verbose", "-v", help="Show detailed analysis for each query"),
284+
generate_models: Path = typer.Option(
285+
None,
286+
"--generate-models",
287+
"-g",
288+
help="Generate model definitions from queries and write to this directory",
289+
),
282290
):
283291
"""
284292
Analyze SQL queries for semantic layer coverage.
285293
286294
Determines which queries can be rewritten using your semantic layer and
287295
identifies missing models, dimensions, and metrics.
288296
297+
Can also bootstrap a semantic layer from raw SQL queries.
298+
289299
Examples:
300+
# Analyze coverage
290301
sidemantic coverage models/ --queries queries/
291-
sidemantic coverage models/ --queries query.sql
292-
sidemantic coverage models/ --queries queries/ --verbose
302+
sidemantic coverage models/ --queries query.sql --verbose
303+
304+
# Bootstrap semantic layer from raw queries
305+
sidemantic coverage --queries raw_queries/ --generate-models output/
293306
"""
294307
from sidemantic.core.coverage_analyzer import CoverageAnalyzer
295308

296-
if not directory.exists():
297-
typer.echo(f"Error: Directory {directory} does not exist", err=True)
298-
raise typer.Exit(1)
299-
300309
if not queries:
301310
typer.echo("Error: --queries is required", err=True)
302-
typer.echo("Usage: sidemantic coverage <models_dir> --queries <path>", err=True)
311+
typer.echo("Usage: sidemantic coverage [models_dir] --queries <path>", err=True)
303312
raise typer.Exit(1)
304313

305314
if not queries.exists():
306315
typer.echo(f"Error: {queries} does not exist", err=True)
307316
raise typer.Exit(1)
308317

309-
try:
310-
# Load semantic layer
311-
layer = SemanticLayer()
312-
load_from_directory(layer, str(directory))
318+
# Bootstrap mode - generate models from queries
319+
if generate_models:
320+
try:
321+
# Create empty semantic layer for analysis
322+
layer = SemanticLayer(auto_register=False)
323+
analyzer = CoverageAnalyzer(layer)
324+
325+
# Analyze queries
326+
if queries.is_file():
327+
query_list = queries.read_text().split(";")
328+
query_list = [q.strip() for q in query_list if q.strip()]
329+
report = analyzer.analyze_queries(query_list)
330+
else:
331+
report = analyzer.analyze_folder(str(queries))
313332

314-
if not layer.graph.models:
315-
typer.echo("Error: No models found in semantic layer", err=True)
333+
# Generate model definitions
334+
typer.echo("\nGenerating model definitions...", err=True)
335+
models = analyzer.generate_models(report)
336+
337+
models_dir = generate_models / "models"
338+
analyzer.write_model_files(models, str(models_dir))
339+
340+
# Generate rewritten queries
341+
typer.echo("\nGenerating rewritten queries...", err=True)
342+
rewritten = analyzer.generate_rewritten_queries(report)
343+
344+
queries_dir = generate_models / "rewritten_queries"
345+
analyzer.write_rewritten_queries(rewritten, str(queries_dir))
346+
347+
typer.echo(
348+
f"\n✓ Generated {len(models)} models and {len(rewritten)} rewritten queries in {generate_models}",
349+
err=True,
350+
)
351+
352+
except Exception as e:
353+
typer.echo(f"Error: {e}", err=True)
354+
import traceback
355+
356+
traceback.print_exc()
316357
raise typer.Exit(1)
317358

318-
# Create analyzer
319-
analyzer = CoverageAnalyzer(layer)
359+
# Coverage analysis mode - compare queries against existing models
360+
else:
361+
if not directory:
362+
typer.echo("Error: directory is required when not using --generate-models", err=True)
363+
typer.echo("Usage: sidemantic coverage <models_dir> --queries <path>", err=True)
364+
raise typer.Exit(1)
320365

321-
# Analyze queries
322-
if queries.is_file():
323-
# Single file - load queries from it
324-
query_list = queries.read_text().split(";")
325-
query_list = [q.strip() for q in query_list if q.strip()]
326-
report = analyzer.analyze_queries(query_list)
327-
else:
328-
# Directory - load all .sql files
329-
report = analyzer.analyze_folder(str(queries))
366+
if not directory.exists():
367+
typer.echo(f"Error: Directory {directory} does not exist", err=True)
368+
raise typer.Exit(1)
330369

331-
# Print report
332-
analyzer.print_report(report, verbose=verbose)
370+
try:
371+
# Load semantic layer
372+
layer = SemanticLayer()
373+
load_from_directory(layer, str(directory))
333374

334-
except Exception as e:
335-
typer.echo(f"Error: {e}", err=True)
336-
import traceback
375+
if not layer.graph.models:
376+
typer.echo("Error: No models found in semantic layer", err=True)
377+
raise typer.Exit(1)
337378

338-
traceback.print_exc()
339-
raise typer.Exit(1)
379+
# Create analyzer
380+
analyzer = CoverageAnalyzer(layer)
381+
382+
# Analyze queries
383+
if queries.is_file():
384+
# Single file - load queries from it
385+
query_list = queries.read_text().split(";")
386+
query_list = [q.strip() for q in query_list if q.strip()]
387+
report = analyzer.analyze_queries(query_list)
388+
else:
389+
# Directory - load all .sql files
390+
report = analyzer.analyze_folder(str(queries))
391+
392+
# Print report
393+
analyzer.print_report(report, verbose=verbose)
394+
395+
except Exception as e:
396+
typer.echo(f"Error: {e}", err=True)
397+
import traceback
398+
399+
traceback.print_exc()
400+
raise typer.Exit(1)
340401

341402

342403
@app.command()

0 commit comments

Comments
 (0)