Skip to content

Commit c8e36a1

Browse files
Add docs/how-collection-works.md
Architecture overview of the collection pipeline for both Full and Lite editions: the minute loop, dispatcher, collector shape, schedule table, retention, Dashboard read path, and where-to-look-next pointers for new contributors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent add515e commit c8e36a1

1 file changed

Lines changed: 207 additions & 0 deletions

File tree

docs/how-collection-works.md

Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
# How Collection Works
2+
3+
A tour of the collection pipeline for people who know SQL but don't know this codebase. Read this, then read three SQL files, and you'll understand 80% of what Performance Monitor is doing on your server.
4+
5+
This doc covers both editions. Full Edition first (SQL Agent → `PerformanceMonitor` database → Dashboard reads), Lite Edition second (WPF app → DuckDB file → same app reads). The shapes are similar; the surface area is different.
6+
7+
---
8+
9+
## Full Edition
10+
11+
### The minute loop
12+
13+
Everything happens inside one SQL Agent job:
14+
15+
| Job | What it runs |
16+
| --- | --- |
17+
| `PerformanceMonitor - Collection` | `EXEC collect.scheduled_master_collector @debug = 0;` on a 1-minute schedule (`Every 1 Minute`) |
18+
| `PerformanceMonitor - Data Retention` | `EXEC config.data_retention @debug = 1;` once a day |
19+
| `PerformanceMonitor - Hung Job Monitor` | Kills the Collection job if it's been stuck past its max duration |
20+
21+
When the Collection job fires, it calls the **scheduled master collector** — the dispatcher. The dispatcher is the heartbeat of the whole system. Every minute it wakes up, figures out which collectors are due, and runs them one at a time.
22+
23+
### The dispatcher
24+
25+
**File**: [`install/42_scheduled_master_collector.sql`](../install/42_scheduled_master_collector.sql)
26+
27+
At the core of the dispatcher is a cursor over `config.collection_schedule` that picks up anything due:
28+
29+
```sql
30+
SELECT
31+
cs.schedule_id,
32+
cs.collector_name,
33+
cs.frequency_minutes,
34+
cs.max_duration_minutes
35+
FROM config.collection_schedule AS cs
36+
WHERE cs.enabled = 1
37+
AND (
38+
@force_run_all = 1
39+
OR cs.next_run_time <= SYSDATETIME()
40+
OR cs.next_run_time IS NULL
41+
)
42+
ORDER BY
43+
cs.next_run_time;
44+
```
45+
46+
For each row, the dispatcher has a big `IF/ELSE IF` block that maps `collector_name` to a specific stored procedure:
47+
48+
```sql
49+
ELSE IF @collector_name = N'default_trace_collector'
50+
BEGIN
51+
EXECUTE collect.default_trace_collector @debug = @debug;
52+
END;
53+
ELSE IF @collector_name = N'blocking_deadlock_analyzer'
54+
BEGIN
55+
EXECUTE collect.blocking_deadlock_analyzer @debug = @debug;
56+
END;
57+
-- ...etc
58+
```
59+
60+
Each collector runs inside its own `BEGIN TRY / BEGIN CATCH` block — a failure in one doesn't stop the rest of the cycle. After each run (success or failure), the dispatcher bumps `last_run_time` and `next_run_time = last_run_time + frequency_minutes` so the next tick knows when that collector is eligible again.
61+
62+
Before any of this, the dispatcher also does two self-heal steps:
63+
64+
- **Ensures config tables exist** (`config.ensure_config_tables`) — lets you recover from an accidentally-dropped table without reinstalling.
65+
- **Detects server restarts** — if `sqlserver_start_time` has changed since last run, it captures a fresh snapshot of server properties. Config values only change across restarts, so this is the efficient moment to grab them.
66+
67+
### What a collector looks like
68+
69+
Pick any `install/NN_collect_*.sql` file — they all follow the same shape. A minimal example:
70+
71+
**File**: [`install/29_collect_default_trace.sql`](../install/29_collect_default_trace.sql)
72+
73+
```sql
74+
ALTER PROCEDURE
75+
collect.default_trace_collector
76+
(
77+
@hours_back integer = 2,
78+
@include_memory_events bit = 1,
79+
@include_autogrow_events bit = 1,
80+
@include_object_events bit = 1,
81+
-- ...more flags
82+
@debug bit = 0
83+
)
84+
AS
85+
BEGIN
86+
BEGIN TRY
87+
-- 1. Validate parameters
88+
IF @hours_back <= 0 OR @hours_back > 168
89+
BEGIN
90+
RAISERROR(N'@hours_back must be between 1 and 168 hours', 16, 1);
91+
RETURN;
92+
END;
93+
94+
-- 2. Detect first run (empty target table, no prior success in config.collection_log)
95+
IF NOT EXISTS (SELECT 1/0 FROM collect.default_trace_events)
96+
AND NOT EXISTS (SELECT 1/0 FROM config.collection_log WHERE collector_name = N'default_trace_collector' AND collection_status = N'SUCCESS')
97+
BEGIN
98+
SET @cutoff_time = CONVERT(datetime2(7), '19000101'); -- grab everything on first run
99+
END;
100+
101+
-- 3. Query the DMV / system view
102+
INSERT INTO collect.default_trace_events (...)
103+
SELECT ...
104+
FROM sys.fn_trace_gettable(@trace_path, @max_files) AS ft
105+
WHERE ft.StartTime >= @cutoff_time
106+
AND <per-collector filters>
107+
AND NOT EXISTS (<dedupe lookup on event_time + event_class + spid + event_sequence>);
108+
109+
-- 4. Log success to config.collection_log
110+
INSERT INTO config.collection_log (...) VALUES (..., 'SUCCESS', @rows_collected, ...);
111+
END TRY
112+
BEGIN CATCH
113+
-- 5. Log failure with error message
114+
INSERT INTO config.collection_log (...) VALUES (..., 'ERROR', 0, @error_message);
115+
THROW;
116+
END CATCH;
117+
END;
118+
```
119+
120+
Every collector does exactly these five things: **validate, detect first-run, pull from DMV, insert with dedupe, log**. Once you've read one, you've read all thirty. The differences are the source DMV, the filter conditions, and the shape of the destination table.
121+
122+
### The schedule table
123+
124+
**File**: [`install/03_create_config_tables.sql`](../install/03_create_config_tables.sql) (table definition)
125+
126+
`config.collection_schedule` is the single source of truth for *what runs and when*. It has one row per collector:
127+
128+
| Column | Meaning |
129+
| --- | --- |
130+
| `collector_name` | The name the dispatcher's `IF/ELSE` block matches on |
131+
| `enabled` | Bit flag — off means the dispatcher skips this row entirely |
132+
| `frequency_minutes` | How often to run. `0` means "on connect / daily / special" (see below) |
133+
| `last_run_time` | When the collector last started — updated by the dispatcher |
134+
| `next_run_time` | When the collector is next eligible — `last_run_time + frequency_minutes` |
135+
| `max_duration_minutes` | Kill switch for the hung-job monitor |
136+
| `retention_days` | How long to keep data in the target `collect.*` table |
137+
138+
You can edit this table directly, but **don't**. The supported knobs are:
139+
140+
- **`config.apply_collection_preset`** — bulk-sets `frequency_minutes` for all collectors at once (presets: `Aggressive`, `Balanced`, `Low-Impact`).
141+
- **Individual `UPDATE` statements on `enabled`** — turn specific collectors on or off.
142+
143+
**File**: [`install/41_schedule_management.sql`](../install/41_schedule_management.sql) has the preset procedure and some helper procs for listing / resetting the schedule.
144+
145+
### Where does the data go?
146+
147+
Each collector writes to a table in the `collect` schema — `collect.query_stats`, `collect.default_trace_events`, `collect.wait_stats`, etc. Same shape each time: a `collection_time datetime2` column, plus whatever the DMV gave us, plus whatever we computed.
148+
149+
Some tables use `COMPRESS()` on large text/XML columns (query text, plan XML) — stored as `varbinary(max)` and wrapped in `DECOMPRESS()` on read. That's why query text looks like gibberish if you `SELECT * FROM collect.query_stats` directly — read through `v_query_stats` instead, which handles the decompression.
150+
151+
### The Dashboard read path
152+
153+
The Dashboard is a WPF app. It connects to the `PerformanceMonitor` database and issues SELECT queries. No collection happens in the app — the Dashboard is purely a reader. Every time you pick a time range, change a tab, or hit refresh, the app runs a SQL query against `collect.*` tables or `v_*` views, pulls rows into a `List<T>`, and binds that list to a WPF DataGrid or a ScottPlot chart.
154+
155+
The query layer lives in `Dashboard/Services/DatabaseService.*.cs` — split by concern (`DatabaseService.QueryPerformance.cs`, `DatabaseService.SystemEvents.cs`, etc.). Each file is just SQL in C# strings. If the Dashboard is showing you something, there's a method somewhere in that folder returning it.
156+
157+
### Retention
158+
159+
**File**: [`install/45_create_agent_jobs.sql`](../install/45_create_agent_jobs.sql) (job definition) and wherever `config.data_retention` lives.
160+
161+
Once a day, the `PerformanceMonitor - Data Retention` job runs a `DELETE` loop per `collect.*` table, respecting each row's `retention_days` from `config.collection_schedule`. Targeted batched deletes, not a truncate — history older than the retention window disappears; recent data is untouched.
162+
163+
---
164+
165+
## Lite Edition
166+
167+
### What's different
168+
169+
Lite is a standalone WPF app — **no SQL Agent involved, no PerformanceMonitor database**. The app itself is the collector, and the storage is a local DuckDB file (`%LocalAppData%\PerformanceMonitorLite\pm_lite.duckdb`).
170+
171+
The shape still mirrors Full: a dispatcher picks collectors, each collector pulls from DMVs and writes to a destination table, and a reader service hands data to the UI.
172+
173+
### The two services
174+
175+
**Writer**: [`Lite/Services/RemoteCollectorService.cs`](../Lite/Services/RemoteCollectorService.cs) plus one `RemoteCollectorService.<Name>.cs` partial per collector (19 of them). The service opens a `SqlConnection` to the monitored server, runs DMV queries, and bulk-inserts results into DuckDB.
176+
177+
**Reader**: [`Lite/Services/LocalDataService.*.cs`](../Lite/Services/) — queries DuckDB and returns results to the UI.
178+
179+
Only one connection writes at a time. DuckDB is single-writer, so within a given server the collectors run **sequentially** (not in parallel). Multi-server parallelism still works — each monitored server runs its own serialized collector chain.
180+
181+
### The schedule
182+
183+
**File**: [`Lite/config/collection_schedule.json`](../Lite/config/collection_schedule.json)
184+
185+
A JSON file, not a table. User-editable. The Lite app reads it at startup and at each wake-up tick. Same shape as the Full Edition schedule (name, enabled, frequency_minutes, retention_days) with one convention: `frequency_minutes: 0` means "run once at connect time" — used for server config, database config, trace flags, etc. that don't change between restarts.
186+
187+
### Data retention
188+
189+
Lite runs retention inline as part of each collection cycle — no separate job. Each collector checks its `retention_days` against the max timestamp in its target table and deletes older rows. DuckDB checkpoints after each cycle to flush the WAL.
190+
191+
---
192+
193+
## Where to look next
194+
195+
If you want to **understand a specific feature**, find the code from the UI outward:
196+
1. Find the grid/chart in the app.
197+
2. Find its XAML file (`Dashboard/*.xaml` or `Lite/Controls/*.xaml`).
198+
3. Follow the `Click` handler or `ItemsSource` binding to the `*.xaml.cs` file.
199+
4. Follow the service call (`_databaseService.GetXxxAsync(...)` in Full, `LocalDataService.GetXxxAsync(...)` in Lite) to the query.
200+
201+
If you want to **understand a specific collector**, read:
202+
1. `install/NN_collect_<name>.sql` for Full Edition, or
203+
2. `Lite/Services/RemoteCollectorService.<Name>.cs` for Lite.
204+
205+
If you want to **add a collector or a new data source**, the dispatcher file in Full (`42_scheduled_master_collector.sql`) or `RemoteCollectorService.cs` in Lite is where you wire it up — those are the files that know about every collector.
206+
207+
If something feels genuinely undocumented rather than "read the code," open an issue. Gaps get prioritized based on what comes up.

0 commit comments

Comments
 (0)