Skip to content

feat(supervisor): publish client-side dequeue API latency as a Prometheus histogram#3887

Merged
myftija merged 5 commits into
mainfrom
supervisor-dequeue-latency-metric
Jun 10, 2026
Merged

feat(supervisor): publish client-side dequeue API latency as a Prometheus histogram#3887
myftija merged 5 commits into
mainfrom
supervisor-dequeue-latency-metric

Conversation

@myftija

@myftija myftija commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

The supervisor's dequeue round-trip time (POST /engine/v1/worker-actions/dequeue) was measured but only flowed into wide events and OTel span attributes — there was no Prometheus series, so latency percentiles and error rates weren't queryable. This adds queue_consumer_pool_dequeue_duration_seconds (histogram, label outcome=success|empty|error) to the existing consumer-pool metrics, scraped automatically by the existing ServiceMonitors on queue-raider/schedule-raider/supervisor.

  • Records every dequeue call, including failed ones, which previously emitted no timing at all
  • The pool's shared ConsumerPoolMetrics instance is injected into each consumer (mirrors the BackpressureMetricsBackpressureMonitor wiring)
  • Buckets extend to 30s because wrapZodFetch retries internally (5 attempts, ≥7.5s backoff before a retryable error surfaces)
  • Existing dequeueResponseMs wide-event/span behavior unchanged

myftija added 2 commits June 10, 2026 13:40
…heus histogram

The dequeue round-trip time was only visible in wide events and span
attributes, so there was no way to query latency percentiles or error
rates. Record it as queue_consumer_pool_dequeue_duration_seconds with
an outcome label (success/empty/error), covering failed and timed-out
calls that previously emitted no timing at all. The pool's shared
ConsumerPoolMetrics instance is injected into each consumer, mirroring
how BackpressureMetrics is wired into BackpressureMonitor.
…review fixes

The HTTP client retries internally (5 attempts, >=7.5s of backoff before
a retryable error surfaces), so the 5s bucket ceiling would have pushed
nearly every retried error into +Inf. Extend buckets to 30s and state
in the help text that one observation spans the whole logical call
including retries. Also: stop clobbering a caller-supplied consumer
metrics instance, correct the catch-branch comment (defensive only -
wrapZodFetch never throws), and cover the pool-to-consumer metrics
injection with tests.
@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 6fe5dda

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 25 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/plugins Patch
@trigger.dev/python Patch
@trigger.dev/redis-worker Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@trigger.dev/rbac Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5741ce28-0b93-4d91-b855-f9d236c3039b

📥 Commits

Reviewing files that changed from the base of the PR and between 16b693c and 6fe5dda.

📒 Files selected for processing (1)
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPoolMetrics.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPoolMetrics.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (40)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: packages / 🧪 Unit Tests: Packages (3, 3)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: packages / 🧪 Unit Tests: Packages (2, 3)
  • GitHub Check: packages / 🧪 Unit Tests: Packages (1, 3)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: Build and publish previews
  • GitHub Check: audit
  • GitHub Check: audit
  • GitHub Check: Analyze (javascript-typescript)

Walkthrough

This PR adds a Prometheus histogram for client-side dequeue round-trip latency (labelled by DequeueOutcome: "success" | "empty" | "error"), exposes observeDequeueLatency on ConsumerPoolMetrics, wires the pool’s shared metrics instance into created consumers (with a caller-metrics fallback), measures and records latency in RunQueueConsumer.dequeue() for success/empty/error paths, and adds tests verifying metrics wiring and correct outcome-labeled observations.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description provides comprehensive context: what changed (Prometheus histogram for dequeue latency), why (enable queryable latency metrics), technical details (buckets, outcome labels, injection pattern), and impact (all dequeue calls now measured). However, it does not follow the provided template structure with explicit checklist items or testing/changelog sections. Consider restructuring the description to match the repository template: add the checklist with confirmations, separate Testing and Changelog sections, and clarify test coverage for the new metrics.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a Prometheus histogram metric to publish dequeue API latency in the supervisor, which matches the core objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch supervisor-dequeue-latency-metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

myftija added 2 commits June 10, 2026 15:57
…g-poll boundary

The server parks empty dequeues on a ~10s blocking pop, so nearly all
observations land just above 10s. With only a 10s and a 30s bucket,
histogram_quantile interpolated p95/p99 to ~28-30s while the true
latency was ~10-11s. Add 11/12.5/15/20s buckets so quantiles read
accurately where the distribution actually sits.
@pkg-pr-new

pkg-pr-new Bot commented Jun 10, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@6fe5dda

trigger.dev

npm i https://pkg.pr.new/trigger.dev@6fe5dda

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@6fe5dda

@trigger.dev/plugins

npm i https://pkg.pr.new/@trigger.dev/plugins@6fe5dda

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@6fe5dda

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@6fe5dda

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@6fe5dda

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@6fe5dda

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@6fe5dda

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@6fe5dda

commit: 6fe5dda

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 0 new potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

@myftija myftija merged commit 081b6ba into main Jun 10, 2026
55 checks passed
@myftija myftija deleted the supervisor-dequeue-latency-metric branch June 10, 2026 14:35
ericallam pushed a commit that referenced this pull request Jun 12, 2026
## Summary
7 improvements, 1 bug fix.

## Improvements
- `trigger init` now sets up your AI coding assistant as part of project
setup: pick the MCP server, the agent skills, or both, then scaffold
with the CLI or hand off to your assistant. Adds a new `getting-started`
agent skill that teaches assistants how to bootstrap Trigger.dev
(install the SDK, write `trigger.config.ts`, create a first task, run
`trigger dev`), so the AI-driven setup path works end to end. It ships
in the CLI alongside the existing skills, version-matched to your SDK.
([#3872](#3872))
- `dev` and `deploy` now fail with a clear error when two tasks are
defined with the same id, including across different task types (e.g. a
scheduled task and a regular task sharing an id). Previously the second
definition silently overwrote the first, so one of the tasks would
vanish with no warning. Task ids are detected as duplicates during
indexing (naming each offending id and the files it was found in), and
the same rule is enforced server-side when the background worker is
registered.
([#3865](#3865))
- `trigger skills` installs Trigger.dev agent skills into your coding
agent so it knows how to write tasks, schedules, realtime, and
chat.agent code. The skills ship with the CLI and are copied into each
tool's native skills directory (Claude Code, Cursor, GitHub Copilot, and
Codex / AGENTS.md), and `trigger dev` offers to install them on first
run. ([#3868](#3868))
- Reliability fixes for `chat.agent`. A user message sent while the
agent is streaming is no longer delivered twice (which could run a
duplicate turn), input appends now carry an idempotency key so a retried
send can't duplicate a message, stopping a generation clears the
streaming state so a page reload doesn't replay the stopped turn, and
runs can now carry the full set of dashboard tags instead of being
silently truncated. `onTurnComplete` now fires on errored turns (with
the thrown error attached) and the failed turn's user message is
persisted so it isn't lost on the next run. Custom agents and manual
`chat.writeTurnComplete` callers now trim the output stream, sending a
custom action no longer leaves a second stream reader running, and a
long-lived `watch` subscription no longer grows its dedupe set without
bound. ([#3891](#3891))
- Continuation chat boots no longer stall for around 10 seconds before
the first turn. The `session.in` resume cursor is now found with a
non-blocking records read instead of draining an SSE long-poll (which
always waited out its full 5 second inactivity window, twice per boot),
the boot reads run concurrently, and chat snapshots carry the cursor so
subsequent boots skip the scan entirely.
([#3907](#3907))
- Record client-side dequeue API latency in the supervisor consumer pool
as a Prometheus histogram
(`queue_consumer_pool_dequeue_duration_seconds`, labelled by `outcome`:
success/empty/error).
([#3887](#3887))
- Add `GetProjectEnvironmentsResponseBody` and `ProjectEnvironment`
schemas for the new `GET /api/v1/projects/{projectRef}/environments`
endpoint, which lists the parent environments (dev, staging, preview,
prod) a personal access token can access for a project. Dev is scoped to
the token owner and branch (preview child) environments are excluded.
([#3880](#3880))

## Bug fixes
- Fix two `chat.createSession()` bugs: stopping a generation no longer
wedges the run (the turn loop raced a `totalUsage` promise that never
settles after a stop-abort), and continuation runs now wait for the next
message instead of invoking the model with an empty prompt.
([#3920](#3920))

<details>
<summary>Raw changeset output</summary>

⚠️⚠️⚠️⚠️⚠️⚠️

`main` is currently in **pre mode** so this branch has prereleases
rather than normal releases. If you want to exit prereleases, run
`changeset pre exit` on `main`.

⚠️⚠️⚠️⚠️⚠️⚠️

# Releases
## @trigger.dev/build@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## trigger.dev@4.5.0-rc.6

### Patch Changes

- `trigger init` now sets up your AI coding assistant as part of project
setup: pick the MCP server, the agent skills, or both, then scaffold
with the CLI or hand off to your assistant. Adds a new `getting-started`
agent skill that teaches assistants how to bootstrap Trigger.dev
(install the SDK, write `trigger.config.ts`, create a first task, run
`trigger dev`), so the AI-driven setup path works end to end. It ships
in the CLI alongside the existing skills, version-matched to your SDK.
([#3872](#3872))

- `dev` and `deploy` now fail with a clear error when two tasks are
defined with the same id, including across different task types (e.g. a
scheduled task and a regular task sharing an id). Previously the second
definition silently overwrote the first, so one of the tasks would
vanish with no warning. Task ids are detected as duplicates during
indexing (naming each offending id and the files it was found in), and
the same rule is enforced server-side when the background worker is
registered.
([#3865](#3865))

- `trigger skills` installs Trigger.dev agent skills into your coding
agent so it knows how to write tasks, schedules, realtime, and
chat.agent code. The skills ship with the CLI and are copied into each
tool's native skills directory (Claude Code, Cursor, GitHub Copilot, and
Codex / AGENTS.md), and `trigger dev` offers to install them on first
run. ([#3868](#3868))

    ```bash
    trigger skills --target claude-code
    ```

Replaces the previous `install-rules` command, which stays as an alias.

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`
    -   `@trigger.dev/build@4.5.0-rc.6`
    -   `@trigger.dev/schema-to-json@4.5.0-rc.6`

## @trigger.dev/core@4.5.0-rc.6

### Patch Changes

- Reliability fixes for `chat.agent`. A user message sent while the
agent is streaming is no longer delivered twice (which could run a
duplicate turn), input appends now carry an idempotency key so a retried
send can't duplicate a message, stopping a generation clears the
streaming state so a page reload doesn't replay the stopped turn, and
runs can now carry the full set of dashboard tags instead of being
silently truncated. `onTurnComplete` now fires on errored turns (with
the thrown error attached) and the failed turn's user message is
persisted so it isn't lost on the next run. Custom agents and manual
`chat.writeTurnComplete` callers now trim the output stream, sending a
custom action no longer leaves a second stream reader running, and a
long-lived `watch` subscription no longer grows its dedupe set without
bound. ([#3891](#3891))
- Continuation chat boots no longer stall for around 10 seconds before
the first turn. The `session.in` resume cursor is now found with a
non-blocking records read instead of draining an SSE long-poll (which
always waited out its full 5 second inactivity window, twice per boot),
the boot reads run concurrently, and chat snapshots carry the cursor so
subsequent boots skip the scan entirely.
([#3907](#3907))
- Record client-side dequeue API latency in the supervisor consumer pool
as a Prometheus histogram
(`queue_consumer_pool_dequeue_duration_seconds`, labelled by `outcome`:
success/empty/error).
([#3887](#3887))
- `dev` and `deploy` now fail with a clear error when two tasks are
defined with the same id, including across different task types (e.g. a
scheduled task and a regular task sharing an id). Previously the second
definition silently overwrote the first, so one of the tasks would
vanish with no warning. Task ids are detected as duplicates during
indexing (naming each offending id and the files it was found in), and
the same rule is enforced server-side when the background worker is
registered.
([#3865](#3865))
- Add `GetProjectEnvironmentsResponseBody` and `ProjectEnvironment`
schemas for the new `GET /api/v1/projects/{projectRef}/environments`
endpoint, which lists the parent environments (dev, staging, preview,
prod) a personal access token can access for a project. Dev is scoped to
the token owner and branch (preview child) environments are excluded.
([#3880](#3880))

## @trigger.dev/python@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/sdk@4.5.0-rc.6`
    -   `@trigger.dev/core@4.5.0-rc.6`
    -   `@trigger.dev/build@4.5.0-rc.6`

## @trigger.dev/react-hooks@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## @trigger.dev/redis-worker@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## @trigger.dev/rsc@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## @trigger.dev/schema-to-json@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## @trigger.dev/sdk@4.5.0-rc.6

### Patch Changes

- Reliability fixes for `chat.agent`. A user message sent while the
agent is streaming is no longer delivered twice (which could run a
duplicate turn), input appends now carry an idempotency key so a retried
send can't duplicate a message, stopping a generation clears the
streaming state so a page reload doesn't replay the stopped turn, and
runs can now carry the full set of dashboard tags instead of being
silently truncated. `onTurnComplete` now fires on errored turns (with
the thrown error attached) and the failed turn's user message is
persisted so it isn't lost on the next run. Custom agents and manual
`chat.writeTurnComplete` callers now trim the output stream, sending a
custom action no longer leaves a second stream reader running, and a
long-lived `watch` subscription no longer grows its dedupe set without
bound. ([#3891](#3891))
- Continuation chat boots no longer stall for around 10 seconds before
the first turn. The `session.in` resume cursor is now found with a
non-blocking records read instead of draining an SSE long-poll (which
always waited out its full 5 second inactivity window, twice per boot),
the boot reads run concurrently, and chat snapshots carry the cursor so
subsequent boots skip the scan entirely.
([#3907](#3907))
- Fix `chat.headStart` when `hydrateMessages` is registered. The warm
route's step-1 partial now reaches the agent's accumulator on the
hydrate path, so `onTurnComplete` carries the full first turn (the
head-start user message included), tool-call handovers resume from step
2 instead of re-running step 1, and the assistant `messageId` stays
stable across the handover.
([#3907](#3907))
- Preserve reasoning parts across the `chat.headStart` handover.
Extended-thinking models' step-1 reasoning now lands in the durable
session history (and `onTurnComplete`) under the same assistant
`messageId`, with provider metadata intact so Anthropic thinking
signatures survive replays.
([#3907](#3907))
- Fix two `chat.createSession()` bugs: stopping a generation no longer
wedges the run (the turn loop raced a `totalUsage` promise that never
settles after a stop-abort), and continuation runs now wait for the next
message instead of invoking the model with an empty prompt.
([#3920](#3920))
-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

## @trigger.dev/plugins@4.5.0-rc.6

### Patch Changes

-   Updated dependencies:
    -   `@trigger.dev/core@4.5.0-rc.6`

</details>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants