Skip to content

Commit b627ab4

Browse files
Add cohort metric type for two-level aggregation (#124)
* Add cohort metric type for two-level aggregation with HAVING Cohort metrics aggregate per entity with a HAVING filter, then re-aggregate the filtered results. Common analytics pattern for questions like "users active 2+ days" or "users on multiple platforms." Example: - name: retained_users type: cohort entity: person_id entity_dimensions: [platform] inner_metrics: - name: active_days agg: count_distinct sql: "cast(timestamp as date)" having: "active_days >= 2" agg: count Generates: SELECT platform, COUNT(*) AS retained_users FROM ( SELECT person_id, platform, COUNT(DISTINCT CAST(timestamp AS DATE)) AS active_days FROM events WHERE ... GROUP BY person_id, platform HAVING active_days >= 2 ) cohort_sub GROUP BY platform * Fix cohort metric resolution and dimension unpacking Three fixes in _generate_cohort_metric_query: 1. Graph-level cohort metrics (added via graph.add_metric) now resolve correctly. Previously only scanned model.get_metric(), missing metrics registered at graph level. Now checks graph.get_metric() first, matching the retention resolver pattern. 2. _parse_dimension_refs returns (dim_ref, granularity) tuples, not dicts. The dimension loop was calling .get() on tuples, causing AttributeError when any dimension was included in a cohort query. 3. Query-level dimensions were computed but never added to the inner SELECT/GROUP BY (dead variable inner_select_cols_extra). Moved the inner_select/inner_group join after the dimension loop so dimensions are properly included in both inner and outer queries. Also formats cli.py (pre-existing). * Auto-update JSON schema * Validate cohort+conversion mixes and use dialect-aware date_trunc Two fixes: 1. The conversion branch returned early without checking if cohort metrics were also present, silently dropping them. Now validates that conversion metrics aren't mixed with other special metric types, matching the retention and cohort patterns. 2. Cohort dimension granularity used hardcoded DATE_TRUNC('gran', col) which is invalid on BigQuery. Now uses self._date_trunc() for dialect-aware truncation. * Fix cohort metric outer SQL scope and raise on unresolved dimensions Outer SQL was using _replace_model_placeholder which maps {model} to the inner table alias (t), but the outer query operates on cohort_sub. Also, unresolved or cross-model dimensions were silently dropped instead of raising an error. * Require sql field for non-count cohort aggregations AVG(*), SUM(*), etc. are invalid SQL. Now raises ValueError at generation time when a cohort metric (outer or inner) uses a non-count aggregation without specifying a sql expression. * Validate inner_metrics entries at cohort metric construction time Each inner_metrics entry must have a 'name' key, and non-count aggregations must include a 'sql' field. Catches malformed definitions like inner_metrics: [{}] early instead of failing with KeyError during SQL generation. * Require sql for COUNT_DISTINCT in cohort inner metrics COUNT(DISTINCT *) is invalid SQL. Only bare COUNT can default to *. Updated both metric validation and generator runtime check. * Allow cohort metrics in validation and preserve fields in export - Add 'cohort' to allowed metric types in validate_metric() and skip agg validation for cohort metrics in validate_model() - Serialize inner_metrics, entity_dimensions, and having in both model-level and graph-level metric export paths - Export agg for graph-level metrics (needed for cohort outer agg) * Resolve unqualified model-scoped cohort metrics in window-path check Both metric_needs_window and resolve_metric_ref (inside _generate_with_window_functions) only checked graph-level metrics for unqualified names. Model-scoped cohort metrics like metrics=["multi_platform_users"] were not found, causing either "No models found" errors or infinite recursion. Added model-scan fallback to both resolution paths. * Detect ambiguous cohort metrics and quote reserved-word aliases - Cohort metric name-scan now collects all matches across models and raises ValueError when >1 match found, instead of silently picking the first by insertion order. - _quote_identifier now delegates to sqlglot for simple identifiers, which correctly quotes SQL reserved words like 'order', 'select', etc. This fixes invalid SQL in cohort (and all other) query paths when dimension names collide with reserved words. * Detect ambiguous unqualified metrics in window-path resolver and fix CI flake - resolve_metric_ref model-scan fallback now collects all matches and raises ValueError on ambiguity, preventing silent wrong-model queries for any metric type (not just cohort). - Bump complex_rewrite_performance threshold from 15ms to 20ms to fix flaky CI on slower Python 3.11 runners (hit 15.625ms). * Quote inner metric aliases and ORDER BY fields in cohort SQL Inner metric names, outer metric name, and ORDER BY field names are now passed through _quote_alias so reserved words and special characters produce valid SQL. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent cfc34cd commit b627ab4

7 files changed

Lines changed: 926 additions & 11 deletions

File tree

sidemantic-schema.json

Lines changed: 144 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,22 @@
426426
"description": "Entity to track (e.g., 'user_id')",
427427
"title": "Entity"
428428
},
429+
"entity_dimensions": {
430+
"anyOf": [
431+
{
432+
"items": {
433+
"type": "string"
434+
},
435+
"type": "array"
436+
},
437+
{
438+
"type": "null"
439+
}
440+
],
441+
"default": null,
442+
"description": "Dimensions to carry through from inner to outer aggregation in cohort metrics",
443+
"title": "Entity Dimensions"
444+
},
429445
"extends": {
430446
"anyOf": [
431447
{
@@ -507,6 +523,36 @@
507523
"description": "Grain for period-to-date (e.g., 'month' for MTD)",
508524
"title": "Grain To Date"
509525
},
526+
"having": {
527+
"anyOf": [
528+
{
529+
"type": "string"
530+
},
531+
{
532+
"type": "null"
533+
}
534+
],
535+
"default": null,
536+
"description": "HAVING filter on inner aggregation for cohort metrics",
537+
"title": "Having"
538+
},
539+
"inner_metrics": {
540+
"anyOf": [
541+
{
542+
"items": {
543+
"additionalProperties": true,
544+
"type": "object"
545+
},
546+
"type": "array"
547+
},
548+
{
549+
"type": "null"
550+
}
551+
],
552+
"default": null,
553+
"description": "Per-entity aggregations for cohort metrics (list of {name, agg, sql})",
554+
"title": "Inner Metrics"
555+
},
510556
"label": {
511557
"anyOf": [
512558
{
@@ -680,7 +726,8 @@
680726
"cumulative",
681727
"time_comparison",
682728
"conversion",
683-
"retention"
729+
"retention",
730+
"cohort"
684731
],
685732
"type": "string"
686733
},
@@ -1433,6 +1480,22 @@
14331480
"description": "Entity to track (e.g., 'user_id')",
14341481
"title": "Entity"
14351482
},
1483+
"entity_dimensions": {
1484+
"anyOf": [
1485+
{
1486+
"items": {
1487+
"type": "string"
1488+
},
1489+
"type": "array"
1490+
},
1491+
{
1492+
"type": "null"
1493+
}
1494+
],
1495+
"default": null,
1496+
"description": "Dimensions to carry through from inner to outer aggregation in cohort metrics",
1497+
"title": "Entity Dimensions"
1498+
},
14361499
"extends": {
14371500
"anyOf": [
14381501
{
@@ -1514,6 +1577,36 @@
15141577
"description": "Grain for period-to-date (e.g., 'month' for MTD)",
15151578
"title": "Grain To Date"
15161579
},
1580+
"having": {
1581+
"anyOf": [
1582+
{
1583+
"type": "string"
1584+
},
1585+
{
1586+
"type": "null"
1587+
}
1588+
],
1589+
"default": null,
1590+
"description": "HAVING filter on inner aggregation for cohort metrics",
1591+
"title": "Having"
1592+
},
1593+
"inner_metrics": {
1594+
"anyOf": [
1595+
{
1596+
"items": {
1597+
"additionalProperties": true,
1598+
"type": "object"
1599+
},
1600+
"type": "array"
1601+
},
1602+
{
1603+
"type": "null"
1604+
}
1605+
],
1606+
"default": null,
1607+
"description": "Per-entity aggregations for cohort metrics (list of {name, agg, sql})",
1608+
"title": "Inner Metrics"
1609+
},
15171610
"label": {
15181611
"anyOf": [
15191612
{
@@ -1687,7 +1780,8 @@
16871780
"cumulative",
16881781
"time_comparison",
16891782
"conversion",
1690-
"retention"
1783+
"retention",
1784+
"cohort"
16911785
],
16921786
"type": "string"
16931787
},
@@ -2203,6 +2297,22 @@
22032297
"description": "Entity to track (e.g., 'user_id')",
22042298
"title": "Entity"
22052299
},
2300+
"entity_dimensions": {
2301+
"anyOf": [
2302+
{
2303+
"items": {
2304+
"type": "string"
2305+
},
2306+
"type": "array"
2307+
},
2308+
{
2309+
"type": "null"
2310+
}
2311+
],
2312+
"default": null,
2313+
"description": "Dimensions to carry through from inner to outer aggregation in cohort metrics",
2314+
"title": "Entity Dimensions"
2315+
},
22062316
"extends": {
22072317
"anyOf": [
22082318
{
@@ -2284,6 +2394,36 @@
22842394
"description": "Grain for period-to-date (e.g., 'month' for MTD)",
22852395
"title": "Grain To Date"
22862396
},
2397+
"having": {
2398+
"anyOf": [
2399+
{
2400+
"type": "string"
2401+
},
2402+
{
2403+
"type": "null"
2404+
}
2405+
],
2406+
"default": null,
2407+
"description": "HAVING filter on inner aggregation for cohort metrics",
2408+
"title": "Having"
2409+
},
2410+
"inner_metrics": {
2411+
"anyOf": [
2412+
{
2413+
"items": {
2414+
"additionalProperties": true,
2415+
"type": "object"
2416+
},
2417+
"type": "array"
2418+
},
2419+
{
2420+
"type": "null"
2421+
}
2422+
],
2423+
"default": null,
2424+
"description": "Per-entity aggregations for cohort metrics (list of {name, agg, sql})",
2425+
"title": "Inner Metrics"
2426+
},
22872427
"label": {
22882428
"anyOf": [
22892429
{
@@ -2457,7 +2597,8 @@
24572597
"cumulative",
24582598
"time_comparison",
24592599
"conversion",
2460-
"retention"
2600+
"retention",
2601+
"cohort"
24612602
],
24622603
"type": "string"
24632604
},

sidemantic/adapters/sidemantic.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,10 @@ def _parse_model(self, model_def: dict) -> Model | None:
312312
window_expression=measure_def.get("window_expression"),
313313
window_frame=measure_def.get("window_frame"),
314314
window_order=measure_def.get("window_order"),
315+
# Cohort parameters
316+
inner_metrics=measure_def.get("inner_metrics"),
317+
entity_dimensions=measure_def.get("entity_dimensions"),
318+
having=measure_def.get("having"),
315319
)
316320
measures.append(measure)
317321

@@ -429,6 +433,9 @@ def _parse_metric(self, metric_def: dict) -> Metric | None:
429433
retention_granularity=(metric_def.get("retention_granularity") or metric_def.get("granularity"))
430434
if metric_type == "retention"
431435
else None,
436+
inner_metrics=metric_def.get("inner_metrics"),
437+
entity_dimensions=metric_def.get("entity_dimensions"),
438+
having=metric_def.get("having"),
432439
window=metric_def.get("window"),
433440
grain_to_date=metric_def.get("grain_to_date"),
434441
window_expression=metric_def.get("window_expression"),
@@ -603,6 +610,13 @@ def _export_model(self, model: Model) -> dict:
603610
measure_def["periods"] = measure.periods
604611
if measure.retention_granularity:
605612
measure_def["retention_granularity"] = measure.retention_granularity
613+
# Cohort parameters
614+
if measure.inner_metrics:
615+
measure_def["inner_metrics"] = measure.inner_metrics
616+
if measure.entity_dimensions:
617+
measure_def["entity_dimensions"] = measure.entity_dimensions
618+
if measure.having:
619+
measure_def["having"] = measure.having
606620
# Cumulative/window parameters
607621
if measure.window:
608622
measure_def["window"] = measure.window
@@ -694,13 +708,21 @@ def _export_metric(self, measure: Metric, graph) -> dict:
694708
result["periods"] = measure.periods
695709
if measure.retention_granularity:
696710
result["retention_granularity"] = measure.retention_granularity
711+
if measure.inner_metrics:
712+
result["inner_metrics"] = measure.inner_metrics
713+
if measure.entity_dimensions:
714+
result["entity_dimensions"] = measure.entity_dimensions
715+
if measure.having:
716+
result["having"] = measure.having
697717
if measure.sql:
698718
result["sql"] = measure.sql
699719
# Auto-detect and export dependencies for derived measures
700720
if measure.type == "derived":
701721
dependencies = measure.get_dependencies(graph)
702722
if dependencies:
703723
result["metrics"] = list(dependencies)
724+
if measure.agg:
725+
result["agg"] = measure.agg
704726
if measure.window:
705727
result["window"] = measure.window
706728
if measure.filters:

sidemantic/core/metric.py

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,11 +215,29 @@ def validate_type_specific_fields(self):
215215
raise ValueError("retention metric requires 'entity' field")
216216
if not self.cohort_event:
217217
raise ValueError("retention metric requires 'cohort_event' field")
218+
if self.type == "cohort":
219+
if not self.entity:
220+
raise ValueError("cohort metric requires 'entity' field")
221+
if not self.inner_metrics:
222+
raise ValueError("cohort metric requires 'inner_metrics' field")
223+
for i, im in enumerate(self.inner_metrics):
224+
if not isinstance(im, dict) or not im.get("name"):
225+
raise ValueError(f"cohort metric inner_metrics[{i}] must be a dict with at least a 'name' key")
226+
im_agg = im.get("agg", "count").upper()
227+
if im_agg != "COUNT" and not im.get("sql"):
228+
raise ValueError(
229+
f"cohort metric inner_metrics[{i}] ('{im['name']}') "
230+
f"uses agg '{im_agg}' which requires a 'sql' field"
231+
)
232+
if not self.having:
233+
raise ValueError("cohort metric requires 'having' field")
234+
if not self.agg:
235+
raise ValueError("cohort metric requires 'agg' field for outer aggregation")
218236
return self
219237

220238
# Metric type (if this is a complex metric, not just a simple aggregation)
221-
type: Literal["ratio", "derived", "cumulative", "time_comparison", "conversion", "retention"] | None = Field(
222-
None, description="Metric type for complex calculations"
239+
type: Literal["ratio", "derived", "cumulative", "time_comparison", "conversion", "retention", "cohort"] | None = (
240+
Field(None, description="Metric type for complex calculations")
223241
)
224242

225243
# Ratio parameters
@@ -278,6 +296,15 @@ def validate_type_specific_fields(self):
278296
None, description="Time granularity for retention periods (day, week, month)"
279297
)
280298

299+
# Cohort metric parameters
300+
inner_metrics: list[dict[str, Any]] | None = Field(
301+
None, description="Per-entity aggregations for cohort metrics (list of {name, agg, sql})"
302+
)
303+
entity_dimensions: list[str] | None = Field(
304+
None, description="Dimensions to carry through from inner to outer aggregation in cohort metrics"
305+
)
306+
having: str | None = Field(None, description="HAVING filter on inner aggregation for cohort metrics")
307+
281308
# Common parameters
282309
filters: list[str] | None = Field(None, description="Optional WHERE clause filters")
283310
fill_nulls_with: int | float | str | None = Field(None, description="Default value when result is NULL")

0 commit comments

Comments
 (0)