Skip to content

Commit ad34060

Browse files
author
affsantos
committed
feat: focused profiling — only profile changed columns
For wide models (100+ columns), data-diff now profiles only the columns that actually changed, using sqlglot AST analysis. This reduces BQ compute by ~90% for typical changes to wide models. How it works: - parse_columns.py identifies added_columns + expression_changes - is_cte_change_additive() classifies CTE modifications: - Additive (new LEFT JOINs, new columns) → safe for focused profiling - Structural (WHERE/filter/JOIN changes) → falls back to full profiling - EXCEPT DISTINCT row comparison always covers ALL columns as safety net - --full flag to override and profile everything Also syncs upstream improvements: - data-diff.sh: get_affected_columns() helper, column filter in build_profile_query(), profiling mode tracking per model - template.html: focused profiling banner, skip-row styling for non-profiled columns (shown as — instead of misleading zeros) - SKILL.md: document --full flag and focused profiling behavior
1 parent 491d738 commit ad34060

5 files changed

Lines changed: 506 additions & 73 deletions

File tree

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Visual data diff for dbt + BigQuery. Compare production vs development data afte
99
```bash
1010
dbt build --select my_model # 1. Build your changes
1111
./data-diff.sh my_model # 2. See what changed → opens HTML report
12+
./data-diff.sh --full my_model # 3. (optional) Profile ALL columns
1213
```
1314

1415
Zero configuration — reads your GCP project, dbt project name, and schemas from the manifest automatically.
@@ -53,6 +54,14 @@ Compares your dev tables against production across three layers:
5354

5455
The HTML report includes summary with risk indicators, per-model cards with column profiles (prod vs dev side-by-side), code diffs, and sample rows.
5556

57+
### Focused Profiling
58+
59+
By default, only **changed columns** are profiled — using sqlglot AST analysis to detect which columns were added or had their expressions modified. For wide models (100+ columns), this is dramatically faster.
60+
61+
CTE modifications are classified as **additive** (new LEFT JOINs, new columns — safe for focused profiling) or **structural** (WHERE/filter/JOIN changes — falls back to full profiling automatically). The `EXCEPT DISTINCT` row comparison always covers all columns regardless.
62+
63+
Use `--full` to force profiling of every column when needed.
64+
5665
## AI Agent Integration
5766

5867
Includes a `SKILL.md` for [pi](https://github.com/mariozechner/pi-coding-agent) and Claude Code. The agent suggests running data-diff after validation passes and summarises findings in chat.

SKILL.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,33 @@ Suggest running it when:
4848

4949
# Multiple models
5050
.agents/skills/data-diff/data-diff.sh "int_order_pricing int_product_inventory"
51+
52+
# Full profiling (all columns — slower but complete)
53+
.agents/skills/data-diff/data-diff.sh --full int_order_pricing
5154
```
5255

5356
The script prints progress to stderr and the output HTML file path to
5457
stdout. It automatically opens the page in the browser on macOS.
5558

59+
### Focused profiling (default)
60+
61+
By default, data-diff only profiles **columns that changed** (added or
62+
expression-modified) using sqlglot analysis. This is dramatically faster
63+
for wide models (e.g. a 100+ column model profiling only the ~11 that
64+
changed).
65+
66+
The EXCEPT DISTINCT row comparison still covers **all columns**, so data
67+
changes in non-profiled columns are still caught in the sample rows
68+
section.
69+
70+
Fallback to full profiling happens automatically when:
71+
- sqlglot analysis fails or is unavailable
72+
- Only CTEs changed with no identifiable output column changes
73+
- The model is new (no production counterpart)
74+
75+
Use `--full` to force profiling of all columns when you need complete
76+
statistical comparison (e.g. investigating indirect CTE changes).
77+
5678
### What happens under the hood
5779

5880
| Step | What | Cost |
@@ -61,11 +83,12 @@ stdout. It automatically opens the page in the browser on macOS.
6183
| 2. Code diff | sqlglot AST parse (local) | Free |
6284
| 3. Schema diff | `INFORMATION_SCHEMA` query | 1 fast query / model |
6385
| 4. Extract primary keys | Manifest parsing (local) | Free |
64-
| 5. Profile columns | BQ profiling query | 1 query / model / env |
65-
| 6. Sample rows | `EXCEPT DISTINCT` query | 1 query / model |
86+
| 5. Profile columns | BQ profiling query (focused: only changed cols) | 1 query / model / env |
87+
| 6. Sample rows | `EXCEPT DISTINCT` query (all columns) | 1 query / model |
6688
| 7–8. Assemble + render | JSON → HTML injection | Free |
6789

68-
**Performance**: ~30–60 seconds for up to 5 models.
90+
**Performance**: ~30–60 seconds for up to 5 models. Focused profiling
91+
significantly speeds up wide models (100+ columns → only changed columns).
6992

7093
## Interpreting the Output
7194

0 commit comments

Comments
 (0)