Weight sp_HealthParser #tc wait average by waits count

erikdarlingdata · claude · erikdarlingdata · commit e4b081f78fd3 · 2026-04-19T18:21:29.000-04:00
The #tc aggregation rolls up #topwaits_count rows into wait_type /
rounded-time buckets. It summed the waits count correctly but took
AVG(tc.average_wait_time_ms) — an unweighted mean of already-averaged
per-event values. An event that contributed a single wait got the same
pull on the bucket's output as an event with thousands of waits, so
the displayed "average wait" skewed toward sparse outlier events.

Changed to a weighted average:

    SUM(avg * waits) / NULLIF(SUM(waits), 0)

with CONVERT(decimal(38,2)) on the operands to avoid bigint
multiplication overflow on high-volume waits, and NULLIF to keep
the expression well-defined if every contributing row has
waits = 0. Result is CONVERT(bigint, ...) to preserve the existing
output type.

Left #td alone — its GROUP BY includes the metric columns themselves,
so that block is effectively a DISTINCT rather than an aggregation,
and is paired with a downstream ROW_NUMBER() dedupe step. Different
shape, different concern.

Verified the sproc installs clean and @what_to_check = 'waits' against
system_health runs without errors on SQL Server 2022.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/sp_HealthParser/sp_HealthParser.sql b/sp_HealthParser/sp_HealthParser.sql
@@ -2199,7 +2199,21 @@ AND   ca.utc_timestamp < @end_date';
                 ),
             tc.wait_type,
             waits = SUM(CONVERT(bigint, tc.waits)),
-            average_wait_time_ms = CONVERT(bigint, AVG(tc.average_wait_time_ms)),
+            /*
+            Weighted average rather than AVG(avg): tc.average_wait_time_ms
+            is already a per-event average, so AVG() over the bucket was
+            an unweighted mean of means — events with one wait got the
+            same pull on the output as events with thousands. Weight by
+            waits to get the true bucket-scoped average. NULLIF keeps us
+            safe if every contributing row had waits = 0.
+            */
+            average_wait_time_ms =
+                CONVERT
+                (
+                    bigint,
+                    SUM(CONVERT(decimal(38, 2), tc.average_wait_time_ms) * CONVERT(decimal(38, 2), tc.waits))
+                  / NULLIF(SUM(CONVERT(decimal(38, 2), tc.waits)), 0)
+                ),
             max_wait_time_ms = CONVERT(bigint, MAX(tc.max_wait_time_ms))
         INTO #tc
         FROM #topwaits_count AS tc