fix(orch): upload/V4 header race, including P2P#2532
Open
levb wants to merge 17 commits intolev-compression-finalfrom
Open
fix(orch): upload/V4 header race, including P2P#2532levb wants to merge 17 commits intolev-compression-finalfrom
levb wants to merge 17 commits intolev-compression-finalfrom
Conversation
Fixes a rapid Pause/Resume race where a child layer's V4 header could finalize against a stale, in-flight parent's Builds map. Replaces the position-based UploadTracker and cross-layer PendingBuildInfo with one buildID-keyed UploadCoordinator: child layers wait on the parent's SwapHeader through a per-build SetOnce, used by both the builder and runtime Pause/Checkpoint paths. build.File holds an atomic.Pointer to the header so the upload publishes the finalized V4 header to in-process readers immediately.
- Lift UploadCoordinator construction into factories/run.go so the
orchestrator server (Pause path) and template-manager (build path)
share a single node-wide instance, closing the leaked TTL goroutine
in runBuild.
- Replace SetOnce[struct{}] with the existing utils.ErrorOnce.
- Make uploadCoord nil-tolerant in layer_executor.PauseAndUpload so
one-shot CLIs (create-build, smoketest, benchmarks) can pass nil
without bringing along the coordinator's lifecycle.
…infra into lev-compression-wait-simplified
Replace UploadCoordinator + Snapshot.Upload + compressed/uncompressed uploader pair with sandbox.Uploads (in-flight registry) and sandbox.Upload (per-build session). Fix V3 omitting SwapHeader after upload. Add spans on Wait and async upload.
Peers now signal "GCS is the source of truth" via a single bool (use_storage); they no longer ship serialized headers. Consumers fetch the V4 header from GCS via build.LoadV4 — bounded poll with an optional hint channel for future Redis-pubsub acceleration. Closes the cross-orch chained-build hole: Uploads.Wait refreshes stale parent headers (detected as V4 without self-entry) before constructing child lineage.
Uploads.Wait subscribes to a per-build channel while polling GCS for the parent's V4 header; uploader publishes on Finish. Empty payload = success (poll now); non-empty = upload error (fail fast). Falls back to ticker polling when redis is not configured. PollRemoteStorageForHeader timeout budgets: 30s read-path, 20min upload-path.
CI-only failure: putFinalHeader and TestUploads_Wait_NoFuture_ReadsFromCache built V4-typed headers without a self-entry in Builds, which the new isStale check now flags as stale. Tests pass locally because the non-fix lint pass also runs gofmt-equivalent fixers; the test fixture bug only surfaces under `go test` directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…infra into lev-compression-wait-simplified
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b4dcd8ca9b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
dobrac
requested changes
May 1, 2026
Contributor
dobrac
left a comment
There was a problem hiding this comment.
When we get to fetch in a chunker, we should error out if the header is incomplete - defensive check to catch bugs early
- Serialize IncompletePendingUpload bit into V4 envelope; peer-server forces it on, StoreHeader refuses to persist it (V3 unchanged). - Drop ParentBuildID from Snapshot/Upload; collectAncestorBuilds waits on the full mapping closure (skips self + uuid.Nil). - Various review-driven simplifications
dobrac
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a race in the V4 upload + header path, including across orchestrators (P2P).
SetOncefutures. Replacesupload_trackerinlayer_executorand the ad-hoc plumbing inbuild_upload*. A singleUploadsregistry tracks in-flight uploads per build;waiters block on a future that resolves once the upload (and its V4 header) is durable.
build_upload.go/build_upload_v3.go/build_upload_v4.goare replaced byupload.go+upload_v3.go+upload_v4.go+uploads.gounderpkg/sandbox, with a clearersplit between the per-version upload step and the registry that coordinates waiters.
where the peer-side header could disagree with what later landed in storage.