feat(data-plane): add TQ fault tolerance APIs#2492
Open
pthombre wants to merge 1 commit into
Open
Conversation
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
1d02615 to
430dee5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the recovery/control-plane API surface that the async SingleController needs in order to coordinate TransferQueue without moving tensor payloads through the controller. The new APIs let the controller inspect committed rollout metadata, count queue occupancy, remove trained or stale groups, health-check the data-plane path, and handle TransferQueue/Ray failures through typed exceptions.
What Changed
DataPlaneClient recovery API
Adds the following methods to the shared
DataPlaneClientcontract:ping(timeout_s)to validate data-plane request-path liveness.list_metadata(partition_id)to return non-consuming rollout-group metadata.depth(partition_id)to count committed and complete groups visible to recovery.pop(keys, partition_id)to remove successfully trained rows.evict(keys, partition_id)to remove stale or abandoned rows.get_capabilities()to expose backend recovery guarantees.Metadata and capabilities
Adds
DataPlaneGroupMeta, a control-plane-only record for rollout groups. It includes:partition_idgroup_idkeysweight_versioncreated_atcommittedexpected_num_keyssize_bytestagsThis gives SingleController enough information to select trainable groups, reject stale groups, reconstruct queue depth after restart, and pass key references to the trainer without fetching tensors.
Adds
DataPlaneCapabilitiesso backends can advertise recovery-relevant behavior such as persistent recovery, server-side filtering, atomic batch put, and verified clear support.Typed data-plane failures
Adds a typed exception hierarchy:
DataPlaneErrorDataPlaneUnavailableDataPlaneTimeoutDataPlaneReadErrorDataPlaneWriteErrorDataPlaneClearErrorDataPlaneNotReadyDataPlaneBadRequestThe TransferQueue adapter now translates underlying Ray/TQ/storage exceptions into these data-plane exceptions, allowing SingleController to route failures to recovery logic without parsing generic exception strings.
TransferQueue adapter implementation
Updates
TQDataPlaneClientto:_call_tq()for typed error translation;ping()using the TQ request path;list_metadata()by callingkv_list(), grouping keys bygroup_id, and parsing producer tags;pop()andevict()throughkv_clear()via the base interface.Existing direct-by-key and task-mediated APIs now also surface typed data-plane errors for read/write/clear paths.
NoOp adapter and observability
Updates the in-memory NoOp adapter to implement the same recovery API so unit tests can validate the contract without Ray or TransferQueue.
Extends
MetricsDataPlaneClientto record the new recovery operations:ping,list_metadata,depth,pop, andevict.Tests
Adds and updates unit tests for:
depth()behavior;pop()andevict()removing keys;ping()failure behavior on a closed client;Why This Helps Async SingleController
SingleController needs to orchestrate async rollout, training, recovery, and backpressure while preserving the invariant that it never sees tensor data. These APIs provide the required control-plane boundary.
list_metadata()is the key API for async training. It gives the controller a non-consuming view of what rollout groups exist in TQ, whether they are committed, whether they are complete, and which weight version produced them. The controller can use that to run staleness selection and build a slice of keys for the trainer. Tensor movement remains direct between Trainer/GenWorker and TQ.depth()lets SingleController reconstruct queue occupancy after SC restart or TQ recovery. This is needed to rebuild_tq_capacity_usedand avoid either over-dispatching generation or deadlocking the rollout pump.pop()centralizes deletion of successfully trained rows in SingleController. Trainer trains on keys fetched from TQ, returns a result, and SC removes the keys only after training succeeds. This keeps semaphore release and queue cleanup in one owner.evict()gives SC an explicit path to remove stale or abandoned rollout groups. This prevents stale rows from permanently occupying TQ capacity and blocking generation.ping()plus typed failures give SC clean recovery triggers for_recover_tq(). Instead of depending on raw Ray/TQ exceptions, the controller can catchDataPlaneUnavailableorDataPlaneTimeout, pause rollout dispatch, release local capacity assumptions, wait for TQ to return, and reconstruct state fromlist_metadata().Overall, this PR does not implement SingleController itself. It adds the TransferQueue/DataPlane API surface needed by SingleController to manage async queue state, staleness, cleanup, and recovery safely from metadata only.
Testing
git diff --cached --checkpassed before commit.uvcould not locate the repo-required Python3.13.13interpreter and the system Python lackspytest.