Skip to content

feat(metrics): replace placeholder bytesScanned with real DuckDB profile #25

@gordonmurray

Description

@gordonmurray

run_query currently returns bytesScanned = len(str(rows)) * 2 — a string-length heuristic, not bytes actually read from S3. This undermines the "see what this query costs" value prop.

Fix

Run queries under DuckDB's JSON profiler and pull real numbers from it.

  • Before executing, the connection enables per-query profiling, writing to a per-connection tempfile (e.g. tempfile.NamedTemporaryFile) so concurrent queries (see Global DuckDB connection torn down and rebuilt on every request #9) don't race on a shared filename.
  • After the query completes, parse the JSON profile and extract:
    • bytes_scanned — sum of bytes read from S3-backed scans
    • rows_scanned — pre-filter row count (distinct from rowsReturned)
    • operator-level timings if useful for UI later
  • QueryStats gains rowsScanned. bytesScanned stops being a guess.

Driven by

backend/tests/test_metrics.py::test_query_metrics_calls_profiling is already red, asserting that run_query issues PRAGMA enable_profiling. Strengthen test_bytes_scanned_is_not_dummy once the real metric is plumbed so it actually asserts the value isn't len(str(rows)) * 2.

Why this matters

Closes the "metrics are misleading" critical issue from the Oct 2025 MVP review (see CLAUDE.md). Enables any future billing model that isn't pure compute-time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    correctnessProduces wrong results or unsafe behaviourenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions