Skip to content

feat(data-plane): add TQ fault tolerance APIs#2492

Open
pthombre wants to merge 1 commit into
zhiyul/data_plane_planfrom
pranav/tq_fault_tolerance
Open

feat(data-plane): add TQ fault tolerance APIs#2492
pthombre wants to merge 1 commit into
zhiyul/data_plane_planfrom
pranav/tq_fault_tolerance

Conversation

@pthombre
Copy link
Copy Markdown

Summary

This PR adds the recovery/control-plane API surface that the async SingleController needs in order to coordinate TransferQueue without moving tensor payloads through the controller. The new APIs let the controller inspect committed rollout metadata, count queue occupancy, remove trained or stale groups, health-check the data-plane path, and handle TransferQueue/Ray failures through typed exceptions.

What Changed

DataPlaneClient recovery API

Adds the following methods to the shared DataPlaneClient contract:

  • ping(timeout_s) to validate data-plane request-path liveness.
  • list_metadata(partition_id) to return non-consuming rollout-group metadata.
  • depth(partition_id) to count committed and complete groups visible to recovery.
  • pop(keys, partition_id) to remove successfully trained rows.
  • evict(keys, partition_id) to remove stale or abandoned rows.
  • get_capabilities() to expose backend recovery guarantees.

Metadata and capabilities

Adds DataPlaneGroupMeta, a control-plane-only record for rollout groups. It includes:

  • partition_id
  • group_id
  • keys
  • weight_version
  • created_at
  • committed
  • expected_num_keys
  • size_bytes
  • tags

This gives SingleController enough information to select trainable groups, reject stale groups, reconstruct queue depth after restart, and pass key references to the trainer without fetching tensors.

Adds DataPlaneCapabilities so backends can advertise recovery-relevant behavior such as persistent recovery, server-side filtering, atomic batch put, and verified clear support.

Typed data-plane failures

Adds a typed exception hierarchy:

  • DataPlaneError
  • DataPlaneUnavailable
  • DataPlaneTimeout
  • DataPlaneReadError
  • DataPlaneWriteError
  • DataPlaneClearError
  • DataPlaneNotReady
  • DataPlaneBadRequest

The TransferQueue adapter now translates underlying Ray/TQ/storage exceptions into these data-plane exceptions, allowing SingleController to route failures to recovery logic without parsing generic exception strings.

TransferQueue adapter implementation

Updates TQDataPlaneClient to:

  • wrap TQ calls through _call_tq() for typed error translation;
  • expose ping() using the TQ request path;
  • implement list_metadata() by calling kv_list(), grouping keys by group_id, and parsing producer tags;
  • expose backend capabilities;
  • route pop() and evict() through kv_clear() via the base interface.

Existing direct-by-key and task-mediated APIs now also surface typed data-plane errors for read/write/clear paths.

NoOp adapter and observability

Updates the in-memory NoOp adapter to implement the same recovery API so unit tests can validate the contract without Ray or TransferQueue.

Extends MetricsDataPlaneClient to record the new recovery operations: ping, list_metadata, depth, pop, and evict.

Tests

Adds and updates unit tests for:

  • recovery metadata being non-consuming;
  • committed/complete depth() behavior;
  • pop() and evict() removing keys;
  • ping() failure behavior on a closed client;
  • TQ adapter metadata grouping;
  • typed TQ list/clear/timeout errors;
  • observability coverage for recovery operations;
  • expanded ABC surface checks.

Why This Helps Async SingleController

SingleController needs to orchestrate async rollout, training, recovery, and backpressure while preserving the invariant that it never sees tensor data. These APIs provide the required control-plane boundary.

list_metadata() is the key API for async training. It gives the controller a non-consuming view of what rollout groups exist in TQ, whether they are committed, whether they are complete, and which weight version produced them. The controller can use that to run staleness selection and build a slice of keys for the trainer. Tensor movement remains direct between Trainer/GenWorker and TQ.

depth() lets SingleController reconstruct queue occupancy after SC restart or TQ recovery. This is needed to rebuild _tq_capacity_used and avoid either over-dispatching generation or deadlocking the rollout pump.

pop() centralizes deletion of successfully trained rows in SingleController. Trainer trains on keys fetched from TQ, returns a result, and SC removes the keys only after training succeeds. This keeps semaphore release and queue cleanup in one owner.

evict() gives SC an explicit path to remove stale or abandoned rollout groups. This prevents stale rows from permanently occupying TQ capacity and blocking generation.

ping() plus typed failures give SC clean recovery triggers for _recover_tq(). Instead of depending on raw Ray/TQ exceptions, the controller can catch DataPlaneUnavailable or DataPlaneTimeout, pause rollout dispatch, release local capacity assumptions, wait for TQ to return, and reconstruct state from list_metadata().

Overall, this PR does not implement SingleController itself. It adds the TransferQueue/DataPlane API surface needed by SingleController to manage async queue state, staleness, cleanup, and recovery safely from metadata only.

Testing

  • git diff --cached --check passed before commit.
  • Python compile checks passed for the changed data-plane files and the new TQ recovery API test.
  • Full pytest was not run in this workspace because uv could not locate the repo-required Python 3.13.13 interpreter and the system Python lacks pytest.

@pthombre pthombre requested review from a team as code owners May 14, 2026 02:42
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre pthombre force-pushed the pranav/tq_fault_tolerance branch from 1d02615 to 430dee5 Compare May 14, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant