Findings, Culprits, Actions #290

rferraton · 2026-04-26T21:36:02Z

rferraton
Apr 26, 2026

Findings, Culprits, and Actions: A Suggested Model for PerformanceStudio

Executive Summary

The best direction is to avoid showing every raw rule hit as an alert. Instead, the system should distinguish between:

Findings: raw detections from analysis rules.
Culprits: clustered root causes behind one or more findings.
Actions: concrete recommendations attached to each culprit.
Display selection: a curated set of the top 1 to 3 culprits, with lower-impact findings collapsed.

This gives users a cleaner diagnostic experience and aligns with the goal of ranking issues by expected performance impact rather than by subjective severity labels.

1. Culprits insteads of Findings

The system should select the top 3 root-cause culprits, not the top 3 individual findings.

A single real performance problem can produce many findings:

Bad cardinality estimate
  -> wrong join type
  -> spill
  -> huge memory grant
  -> parallel skew
  -> CX waits
  -> many operator warnings

If the tool selects the top 3 findings directly, it may show three symptoms of the same underlying problem. Instead, add a clustering layer:

Raw findings
    ↓
Cluster by root cause
    ↓
Score each culprit
    ↓
Show top 1–3 culprit groups

Example output:

1. Cardinality underestimation on CustomerOrders filter
   Estimated benefit: up to 68%
   Main symptoms:
   - Hash spill on node 17
   - Bad nested loops choice on node 21
   - Excessive reads from Orders index
   Suggested actions:
   - Index
   - Rewrite
   - Statistics/model review

2. ASYNC_NETWORK_IO dominates elapsed time
   Estimated benefit: up to 42%
   Suggested actions:
   - Application/client fetch pattern
   - Result-set reduction

2. Scoring Model

Each culprit should be scored using more than MaxBenefitPercent alone.

Suggested formula:

CulpritScore =
    MaxBenefitPercent
  × Confidence
  × Actionability
  × EvidenceQuality
  ÷ ComplexityPenalty

MaxBenefitPercent

This is the theoretical upper bound.

Example:

Sort spill consumed 12% of statement elapsed time
=> max benefit <= 12%

Confidence

Confidence measures how likely it is that the finding is real and relevant.

High-confidence signals:

Actual elapsed time is available.
The operator executed.
Wait duration is significant.
Row counts are internally consistent.

Low-confidence signals:

Estimated plan only.
Zero executions.
Finding inferred from text pattern only.
Missing runtime statistics.

This helps avoid false positives such as warnings on nodes with zero executions or technically true findings that are not useful.

Actionability

Actionability measures whether the tool can suggest a concrete fix.

High actionability:

Missing useful index candidate.
Non-SARGable predicate.
Scalar UDF.
Excessive key lookup.
Obvious implicit conversion.

Medium actionability:

Bad cardinality estimate.
Parameter sensitivity.
Spill caused by underestimated rows.

Low actionability:

Generic CXPACKET.
Generic parallel skew.
Broad memory grant warning.

EvidenceQuality

Evidence quality measures the strength of supporting evidence.

Strong evidence:

Operator self-time is high.
Actual reads are high.
Wait time is high.
Issue repeats across executions.

Weak evidence:

Based only on estimated cost.
Based only on optimizer estimate.
No runtime stats available.

ComplexityPenalty

Complexity penalty prevents the tool from recommending risky or expensive work too aggressively.

Low penalty:

Add covering index.
Update statistics.
Remove scalar function.
Fix implicit conversion.

Higher penalty:

Rewrite query shape.
Change data model.
Change MAXDOP or cost threshold.
Alter application fetch behavior.

3. Cutoff Rules

Use three cutoff mechanisms instead of one.

A. Maximum Number of Visible Culprits

Default:

Show maximum 3 culprit groups

But if one culprit dominates, do not force additional noise:

If culprit #1 explains >= 70% of quantified benefit:
    show only culprit #1 prominently
    collapse the rest

B. Minimum Benefit Threshold

For actual plans:

Hide from main alert list if:
    MaxBenefitPercent < 5%
    AND absolute possible saving < 250 ms

For long-running queries, include an absolute time threshold:

Show if:
    MaxBenefitPercent >= 5%
    OR possible saving >= 1 second

This prevents hiding a 2% issue on a 10-minute query.

C. Marginal Value Threshold

If the first two culprits already explain most of the quantified opportunity, do not force a third.

Example:

Culprit 1: 61%
Culprit 2: 18%
Culprit 3: 2%

Show:
1 and 2

Collapse:
3 and below as low-impact findings

4. Display Model

Avoid using the word alert for most items. Prefer:

Main Performance Culprits
Secondary Findings
Diagnostics

Suggested UI structure:

Main Performance Culprits
-------------------------
1. Bad cardinality estimate caused expensive join/spill
   Up to 64% benefit
   Confidence: High
   Action categories: Index, Rewrite, Model

2. Client/network consumption bottleneck
   Up to 41% benefit
   Confidence: High
   Action categories: Application, Query shape

Secondary Findings
------------------
- Key lookup, up to 3%
- Small spill, up to 1%
- Row goal, unknown benefit

Diagnostics
-----------
- Local variable detected
- Optimize for unknown
- Plan uses parallelism

This keeps the main screen focused without making secondary findings disappear completely.

5. Action Categories

Recommended Categories

Category	Meaning	Examples
Index	Access path or physical index change	Add index, include column, filtered index, remove lookup
Rewrite	Query text or query shape change	SARGability, remove scalar UDF, split query, pre-aggregate
Model	Schema or relational model issue	Normalization issue, wrong data type, computed column, partitioning, many-to-many explosion
Statistics	Optimizer input quality	Update stats, filtered stats, multi-column stats, ascending key problem
Config	SQL Server, database, or session settings	`MAXDOP`, cost threshold, memory grant feedback, compatibility level
Runtime / Resource	Infrastructure or runtime pressure	tempdb, memory, CPU pressure, storage latency
Application	Client or ORM behavior	`ASYNC_NETWORK_IO`, fetch size, result-set over-fetching, chatty calls
Concurrency	Blocking: Index, application design or Isolation Level, latching : index or thread starvation	Locks, latch contention, `THREADPOOL`
Investigate	Not enough confidence for a fix	Needs actual plan, parameter values, or workload context

I would also keep Application separate from Rewrite. For example, ASYNC_NETWORK_IO often points to slow client consumption, oversized result sets, or row-by-row application processing rather than a pure SQL rewrite problem.

6. Suggested Action Object

Internally, each recommendation could be represented like this:

{
  "culprit_id": "cardinality-root-node-17",
  "title": "Cardinality underestimation caused spill and bad join choice",
  "max_benefit_percent": 64.2,
  "confidence": "high",
  "evidence_quality": "actual-runtime",
  "primary_category": "Statistics",
  "secondary_categories": ["Index", "Rewrite"],
  "complexity": "medium",
  "risk": "medium",
  "actions": [
    {
      "category": "Statistics",
      "text": "Review statistics on Sales.CustomerId and Sales.OrderDate.",
      "confidence": "medium"
    },
    {
      "category": "Index",
      "text": "Consider an index supporting the filter and join predicates.",
      "confidence": "medium"
    },
    {
      "category": "Rewrite",
      "text": "Check whether the filter can be pushed earlier in the query.",
      "confidence": "low"
    }
  ],
  "symptoms": [
    "Hash Match spill on node 22",
    "Nested Loops high executions on node 31",
    "Memory grant underestimated"
  ],
  "hidden_findings_count": 7
}

The important point is that one culprit may have several possible action categories.

Example:

Bad cardinality estimate

Could be fixed by:

Statistics
Index
Rewrite
Model
Config

Do not force a single category too early.
May be use preferences in parameters : allow/deny action categories, order categories

7. Preferred Category Taxonomy

Initial taxonomy:

Index
Rewrite
Statistics
Model
Config
Runtime
Application
Concurrency
Investigate

Visual grouping:

Database design:
- Index
- Statistics
- Model

Query design:
- Rewrite

Environment:
- Config
- Runtime
- Concurrency

Outside SQL:
- Application

Fallback:
- Investigate

8. Alert Cutoff Logic

Suggested pseudo-code:

findings = run_all_rules(plan)

findings = remove_invalid_findings(findings)
  where operator_executions = 0
  where benefit = 0 and not diagnostically important
  where rule does not apply to operator type

culprits = cluster_findings_by_root_cause(findings)

for each culprit:
    culprit.max_benefit = estimate_max_benefit(culprit)
    culprit.confidence = estimate_confidence(culprit)
    culprit.actionability = estimate_actionability(culprit)
    culprit.score = max_benefit * confidence * actionability

culprits = sort_by_score_desc(culprits)

main_culprits = []
for culprit in culprits:
    if len(main_culprits) == 3:
        break

    if culprit.max_benefit < 5% and culprit.absolute_ms < 1000:
        continue

    if culprit.confidence == "low" and culprit.actionability == "low":
        move_to_investigate()
        continue

    main_culprits.add(culprit)

secondary = remaining quantified findings
diagnostics = non-quantifiable findings

9. Do Not Hide Severe Uncertainty

Some items may not have high calculated benefit but should still be visible as diagnostics:

- Missing actual runtime stats
- Trivial plan
- Parameter-sensitive plan suspicion
- Local variables
- Optimize for unknown
- Forced plan
- Outdated compatibility level
- Memory grant feedback active or disabled

These should not appear in the main performance culprits list unless clearly connected to runtime cost.

10. Example Final Output

Main performance culprits

1. Cardinality underestimation around node 17
   Estimated benefit: up to 64%
   Confidence: High
   Categories: Statistics, Index, Rewrite
   Why it matters:
   The bad estimate appears to drive a spill and an expensive join choice.
   Suggested next actions:
   - Review statistics on the filter and join columns.
   - Consider an index that supports the selective predicate.
   - Check whether the predicate can be pushed earlier.

2. Client/network consumption bottleneck
   Estimated benefit: up to 41%
   Confidence: High
   Categories: Application, Rewrite
   Why it matters:
   ASYNC_NETWORK_IO dominates elapsed time.
   Suggested next actions:
   - Check whether the application fetches rows slowly 
   - Reduce the result set if possible  : checks columns count and types, checks rows
   - Avoid row-by-row client processing : control application fetch size

Secondary findings

- Key lookup: up to 3%
- Minor spill: up to 1%
- Parallel skew: low confidence

Diagnostics

- Row goal detected
- Local variable detected
- Several unquantified legacy rules were suppressed

Recommended Implementation Phases

Phase 1: Keep current findings, but add clustering and top-3 culprit selection.
Phase 2: Classify every finding/action into categories.
Phase 3: Add confidence/actionability scoring.
Phase 4: Suppress or collapse low-impact findings.
Phase 5: Improve cardinality root-cause attribution.

A valuable change would be

Stop presenting raw findings directly to the user.
Present selected culprit groups instead.

That gives a cleaner user experience even before the scoring is perfect.

erikdarlingdata · 2026-04-26T23:01:09Z

erikdarlingdata
Apr 26, 2026
Maintainer

Thanks for writing this up — there's strong alignment with the #215 refactor that's currently in design. Some of what you're describing is already the direction; some of it is genuinely new and worth absorbing.

What's already in flight

I posted a Stage 6 design draft on #215 covering a "per-operator consolidation" refactor that maps cleanly onto your Findings/Culprits/Actions model:

A (operator-level findings) — a single OperatorFinding per operator with a list of fix bullets, one operator-level benefit %, no per-fix benefit. That's exactly your "one item with multiple symptoms" idea applied to a single operator.
B (plan-level properties) — Runtime card metrics, waits, compile, MAXDOP, serial reason. Already a separate visual section in v1.8.0.
C (cardinality findings) — CardinalityFinding with a source node and an attributed victim node downstream, costed by walking up to find the operator that made a bad choice contingent on the estimate. This is the closest fit to your "cluster findings by root cause" — a single bad estimate that drove a spill, a wrong join choice, and an oversized memory grant rolls up into one CardinalityFinding pointing at the source.
D (non-costable diagnostics) — your "Diagnostics" section. Already visually separated and excluded from the main benefit-sorted list. Local variables, OPTIMIZE FOR UNKNOWN, excessive memory grants, etc. live here.

Your "main culprits / secondary findings / diagnostics" three-tier display maps onto A/C / lower-benefit-A / B+D respectively without changing the data model.

The full draft is on #215 (comment link) and covers six implementation stages.

What's new and worth folding in

Three pieces from your write-up that aren't in the #215 design:

1. Action category taxonomy. Index / Rewrite / Statistics / Model / Config / Runtime / Application / Concurrency / Investigate. Cuts orthogonally to A/B/C/D — A/B/C/D is where the cost lives, your taxonomy is what kind of fix is needed. A single OperatorFinding could carry a primary category and a list of secondary ones. Folding this in lets users filter by what they can actually change ("show me Index suggestions only" vs "show me Application-side issues only").

2. Multi-factor scoring. #215 currently only has MaxBenefitPercent. Your CulpritScore = MaxBenefit × Confidence × Actionability × EvidenceQuality ÷ ComplexityPenalty adds dimensions that are individually defensible:

Confidence handles the estimated-plan / zero-execution / text-pattern-only cases that today render at full weight
Actionability demotes generic CXPACKET / parallel-skew / broad-grant warnings without dropping them
ComplexityPenalty would prevent aggressive MAXDOP / cost-threshold / data-model recommendations from ranking equal to "add an INCLUDE column"

The risk is making the score uninterpretable — Joe's been pushing on "the cost is in the operator itself, don't tie benefit to individual fix ideas". I'd want to keep MaxBenefitPercent as the headline number (it has a clear physical meaning — % of statement elapsed) and use the multipliers as a secondary RankScore for the top-N selection only, not for the displayed benefit %.

3. Top-1-to-3 cap with smart collapse. #215 sorts by benefit but doesn't truncate; everything renders. Your dominant-culprit rule ("if culprit #1 explains ≥70% of quantified benefit, collapse the rest") and marginal-value cutoff are good policy on top of the existing data. This is a UI layer that doesn't require model changes — it could be a config knob (MainCulpritsCount, MarginalThresholdPercent).

Where I'd push back

Cross-operator clustering ("bad cardinality on node 17 drove spill on node 22 drove bad join on node 31, all one Culprit"). Stage 6c gets you most of this for cardinality-driven clusters via the source→victim attribution. Going further — automatically clustering arbitrary findings across multiple operators into a single Culprit — is a hard correctness problem. Easy to be confidently wrong. I'd hold off on this beyond what 6c delivers and let a couple of release cycles inform whether the rollups are obvious enough to automate.

"Stop presenting raw findings directly to the user." A power-user mode (Joe's tested as one) needs the raw findings list. The Stage 6 layout keeps it but pushes A and C consolidations to the top; raw rule output stays accessible. I'd resist a fully-curated-only display because it removes the audit trail that makes the tool trustworthy when scoring is wrong.

Concrete next step

Once Joe signs off on the four open questions in #215, the Stage 6 implementation order naturally folds in your contributions:

Action categories slot into 6b (when each rule migrates to OperatorFinding, tag it with primary/secondary categories at the same time)
Multi-factor RankScore slots into 6c (once C costing exists; layer the multipliers on top)
Top-N display + collapse policy slots into 6f as a UI-only change

If you want, I can edit the #215 design doc to incorporate these and credit this discussion.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Findings, Culprits, Actions #290

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Findings, Culprits, Actions #290

Uh oh!

rferraton Apr 26, 2026

Findings, Culprits, and Actions: A Suggested Model for PerformanceStudio

Executive Summary

1. Culprits insteads of Findings

2. Scoring Model

MaxBenefitPercent

Confidence

Actionability

EvidenceQuality

ComplexityPenalty

3. Cutoff Rules

A. Maximum Number of Visible Culprits

B. Minimum Benefit Threshold

C. Marginal Value Threshold

4. Display Model

5. Action Categories

Recommended Categories

6. Suggested Action Object

7. Preferred Category Taxonomy

8. Alert Cutoff Logic

9. Do Not Hide Severe Uncertainty

10. Example Final Output

Recommended Implementation Phases

Replies: 1 comment

Uh oh!

erikdarlingdata Apr 26, 2026 Maintainer

What's already in flight

What's new and worth folding in

Where I'd push back

Concrete next step

rferraton
Apr 26, 2026

erikdarlingdata
Apr 26, 2026
Maintainer