Skip to content

Commit 0597991

Browse files
authored
DuckDB backend for benchmarks website (#7491)
## Summary This PR started as a joke with @joseph-isaacs, but I actually think the result is quite nice (even considering the diff size). This PR changes the benchmarks website backend from having a specialized ETL in JS, to processing the raw JSON into DuckDB tables, and using SQL to serve all endpoints. ### Tooling Moving to SQL also makes it easier to explore data locally, and this change includes tools and documentation on how to get the data into a local DuckDB instance, including generating all required SQL statements and downloading the data. ### Docs This PR includes new docs about the benchmarks website: - README.md includes general setup, development and how to explore data locally. - SCHEMA.md describes the various tables and views, and how the API uses them - ETL.md attempts at describing the full lifecycle of the data, from being generated by the post-merge CI run, how various fields actually look and the processing steps it goes through before reaching the API. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>
1 parent 50531ab commit 0597991

22 files changed

Lines changed: 2820 additions & 734 deletions

benchmarks-website/Dockerfile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
1-
FROM node:24-alpine AS build
1+
FROM node:24-bookworm-slim AS build
22
WORKDIR /app
33
COPY package.json package-lock.json ./
44
RUN npm ci
55
COPY . .
66
RUN npm run build
77

8-
FROM node:24-alpine
8+
FROM node:24-bookworm-slim
99
WORKDIR /app
1010
COPY package.json package-lock.json ./
1111
RUN npm ci --omit=dev
1212
COPY --from=build /app/dist ./dist
1313
COPY server.js .
14+
COPY store ./store
1415
COPY src/config.js ./src/config.js
1516
EXPOSE 3000
1617
CMD ["node", "server.js"]

benchmarks-website/ETL.md

Lines changed: 302 additions & 0 deletions
Large diffs are not rendered by default.

benchmarks-website/README.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Benchmarks Website
2+
3+
This directory contains the benchmark website frontend, the Node HTTP server, and the DuckDB-based
4+
refresh pipeline that turns the raw benchmark artifacts into chartable time series.
5+
6+
For the data model and table relationships, start with [SCHEMA.md](./SCHEMA.md).
7+
For the upstream artifact generation and refresh/materialization flow, see [ETL.md](./ETL.md).
8+
9+
## Prerequisites
10+
11+
- Node.js `>=18`
12+
- npm
13+
- Optional: DuckDB CLI, if you want to query the cached artifacts directly
14+
15+
Install dependencies:
16+
17+
```bash
18+
cd benchmarks-website
19+
npm install
20+
```
21+
22+
## Development Server
23+
24+
Run the frontend and backend together:
25+
26+
```bash
27+
cd benchmarks-website
28+
npm run dev
29+
```
30+
31+
That starts:
32+
33+
- Vite on `http://localhost:5173`
34+
- the API/static server on `http://localhost:3000`
35+
36+
Useful endpoints:
37+
38+
- `http://localhost:3000/api/metadata`
39+
- `http://localhost:3000/api/health`
40+
41+
The backend refreshes from these artifact URLs by default:
42+
43+
- `https://vortex-ci-benchmark-results.s3.amazonaws.com/data.json.gz`
44+
- `https://vortex-ci-benchmark-results.s3.amazonaws.com/commits.json`
45+
46+
Relevant environment variables:
47+
48+
```bash
49+
PORT=3000
50+
REFRESH_INTERVAL=300000
51+
DATA_URL=https://vortex-ci-benchmark-results.s3.amazonaws.com/data.json.gz
52+
COMMITS_URL=https://vortex-ci-benchmark-results.s3.amazonaws.com/commits.json
53+
CACHE_DIR=/path/to/local/cache
54+
```
55+
56+
`CACHE_DIR` is the most useful one during development. If it is unset, the server uses a temp
57+
directory under `os.tmpdir()`.
58+
59+
## Pull The Data Locally
60+
61+
If you want a predictable local copy for exploration, populate a cache directory yourself and point
62+
the server at it.
63+
64+
```bash
65+
cd benchmarks-website
66+
mkdir -p .cache/benchmarks
67+
68+
curl -L \
69+
https://vortex-ci-benchmark-results.s3.amazonaws.com/data.json.gz \
70+
-o .cache/benchmarks/data.json.gz
71+
72+
curl -L \
73+
https://vortex-ci-benchmark-results.s3.amazonaws.com/commits.json \
74+
-o .cache/benchmarks/commits.json
75+
```
76+
77+
Then start the dev server against that cache:
78+
79+
```bash
80+
cd benchmarks-website
81+
CACHE_DIR="$PWD/.cache/benchmarks" npm run dev
82+
```
83+
84+
On first startup, the server will use the cached files immediately and then asynchronously
85+
revalidate them against S3.
86+
87+
## Explore The Cached Data Directly
88+
89+
Once `data.json.gz` and `commits.json` exist locally, you can query them with DuckDB without
90+
running the website.
91+
92+
Example with the DuckDB CLI:
93+
94+
```sql
95+
create view raw_commits as
96+
select *
97+
from read_json(
98+
'.cache/benchmarks/commits.json',
99+
format = 'newline_delimited',
100+
compression = 'auto_detect',
101+
columns = {
102+
id: 'VARCHAR',
103+
message: 'VARCHAR',
104+
timestamp: 'VARCHAR',
105+
author: 'JSON',
106+
url: 'VARCHAR'
107+
}
108+
);
109+
110+
create view raw_benchmarks as
111+
select *
112+
from read_json(
113+
'.cache/benchmarks/data.json.gz',
114+
format = 'newline_delimited',
115+
compression = 'auto_detect',
116+
columns = {
117+
name: 'VARCHAR',
118+
unit: 'VARCHAR',
119+
value: 'DOUBLE',
120+
storage: 'VARCHAR',
121+
dataset: 'JSON',
122+
commit: 'JSON',
123+
commit_id: 'VARCHAR'
124+
}
125+
);
126+
```
127+
128+
Useful starter queries:
129+
130+
```sql
131+
select count(*) as commit_count from raw_commits;
132+
133+
select count(*) as benchmark_count from raw_benchmarks;
134+
135+
select split_part(name, '/', 1) as prefix, count(*) as rows
136+
from raw_benchmarks
137+
group by 1
138+
order by 2 desc
139+
limit 20;
140+
141+
select
142+
coalesce(json_extract_string(commit, '$.id'), commit_id) as resolved_commit_id,
143+
count(*) as rows
144+
from raw_benchmarks
145+
group by 1
146+
order by 2 desc
147+
limit 20;
148+
```
149+
150+
If you want the normalized relational model rather than the raw JSON views, follow the pipeline in
151+
[SCHEMA.md](./SCHEMA.md) and [`store/sql.js`](./store/sql.js).
152+
153+
## Export The Full Bootstrap SQL
154+
155+
If you want the exact SQL that the server uses to create all config tables, raw views, normalized
156+
tables, and derived projections, export it from the shared SQL builder:
157+
158+
```bash
159+
cd benchmarks-website
160+
npm run export-sql -- \
161+
--data-path "$PWD/.cache/benchmarks/data.json.gz" \
162+
--commits-path "$PWD/.cache/benchmarks/commits.json" \
163+
--output "$PWD/.cache/benchmarks/bootstrap.sql"
164+
```
165+
166+
Then load it in DuckDB:
167+
168+
```bash
169+
duckdb benchmark-explore.duckdb < .cache/benchmarks/bootstrap.sql
170+
```
171+
172+
That creates the same tables and views the server uses, including:
173+
174+
- `query_suites`
175+
- `valid_groups`
176+
- `engine_renames`
177+
- `raw_commits`
178+
- `raw_benchmarks`
179+
- `commit_dim`
180+
- `benchmarks_base`
181+
- `matched_suites`
182+
- `classified_benchmarks`
183+
- `benchmark_points`
184+
- `active_commits`
185+
- `benchmark_points_active`
186+
- `chart_defs`
187+
- `chart_latest_idx`
188+
- `chart_latest_values`
189+
- `chart_series_latest_values`
190+
191+
If you want a portable template instead of path-specific SQL:
192+
193+
```bash
194+
cd benchmarks-website
195+
npm run export-sql -- --placeholders --output bootstrap.template.sql
196+
```
197+
198+
That emits a script using `__DATA_PATH__` and `__COMMITS_PATH__` placeholders.
199+
200+
## Notes
201+
202+
- The website only projects the subset of the raw benchmark JSON it needs for grouping, charting,
203+
and summaries.
204+
- Benchmark names are part of the schema. Group, chart, and series identity are inferred from the
205+
`name`, `storage`, and `dataset` fields during refresh.
206+
- The server returns `503` with `Retry-After` while the initial refresh is still loading.

0 commit comments

Comments
 (0)