Skip to content

fix(orch): upload/V4 header race, including P2P#2532

Open
levb wants to merge 17 commits intolev-compression-finalfrom
lev-compression-wait-simplified
Open

fix(orch): upload/V4 header race, including P2P#2532
levb wants to merge 17 commits intolev-compression-finalfrom
lev-compression-wait-simplified

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented Apr 30, 2026

Fixes a race in the V4 upload + header path, including across orchestrators (P2P).

  • Per-build upload coordination via SetOnce futures. Replaces upload_tracker in layer_executor and the ad-hoc plumbing in build_upload*. A single Uploads registry tracks in-flight uploads per build;
    waiters block on a future that resolves once the upload (and its V4 header) is durable.
  • Collapsed upload plumbing. build_upload.go / build_upload_v3.go / build_upload_v4.go are replaced by upload.go + upload_v3.go + upload_v4.go + uploads.go under pkg/sandbox, with a clearer
    split between the per-version upload step and the registry that coordinates waiters.
  • Drop V4 header carriage from P2P. Peers no longer ship header bytes over gRPC; once a peer signals the build is finalized, consumers fetch the canonical header from GCS. Removes a class of staleness bugs
    where the peer-side header could disagree with what later landed in storage.
  • Redis pub/sub hint for cross-orch upload-finished. Publishes a small "build finalized" hint so a remote orchestrator waiting on a sibling build wakes immediately instead of polling.
  • Adds a rapid pause/resume integration test that exercises the cross-orch wait path.

Fixes a rapid Pause/Resume race where a child layer's V4 header could
finalize against a stale, in-flight parent's Builds map. Replaces the
position-based UploadTracker and cross-layer PendingBuildInfo with one
buildID-keyed UploadCoordinator: child layers wait on the parent's
SwapHeader through a per-build SetOnce, used by both the builder and
runtime Pause/Checkpoint paths. build.File holds an atomic.Pointer to
the header so the upload publishes the finalized V4 header to in-process
readers immediately.
Comment thread packages/orchestrator/pkg/server/sandboxes.go
Comment thread packages/orchestrator/pkg/sandbox/build/build.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/upload_coordinator.go Outdated
levb added 2 commits April 30, 2026 16:42
- Lift UploadCoordinator construction into factories/run.go so the
  orchestrator server (Pause path) and template-manager (build path)
  share a single node-wide instance, closing the leaked TTL goroutine
  in runBuild.
- Replace SetOnce[struct{}] with the existing utils.ErrorOnce.
- Make uploadCoord nil-tolerant in layer_executor.PauseAndUpload so
  one-shot CLIs (create-build, smoketest, benchmarks) can pass nil
  without bringing along the coordinator's lifecycle.
@levb levb assigned dobrac and unassigned tvi May 1, 2026
levb added 2 commits April 30, 2026 22:58
Replace UploadCoordinator + Snapshot.Upload + compressed/uncompressed
uploader pair with sandbox.Uploads (in-flight registry) and
sandbox.Upload (per-build session). Fix V3 omitting SwapHeader after
upload. Add spans on Wait and async upload.
Peers now signal "GCS is the source of truth" via a single bool
(use_storage); they no longer ship serialized headers. Consumers fetch
the V4 header from GCS via build.LoadV4 — bounded poll with an optional
hint channel for future Redis-pubsub acceleration. Closes the cross-orch
chained-build hole: Uploads.Wait refreshes stale parent headers
(detected as V4 without self-entry) before constructing child lineage.
@levb levb changed the title refactor(orch): per-build upload coordination via SetOnce futures fix(orch): upload/V4 header race, including P2P May 1, 2026
Uploads.Wait subscribes to a per-build channel while polling GCS for
the parent's V4 header; uploader publishes on Finish. Empty payload =
success (poll now); non-empty = upload error (fail fast). Falls back
to ticker polling when redis is not configured.

PollRemoteStorageForHeader timeout budgets:
30s read-path, 20min upload-path.
github-actions Bot and others added 3 commits May 1, 2026 15:59
CI-only failure: putFinalHeader and TestUploads_Wait_NoFuture_ReadsFromCache
built V4-typed headers without a self-entry in Builds, which the new
isStale check now flags as stale. Tests pass locally because the
non-fix lint pass also runs gofmt-equivalent fixers; the test fixture
bug only surfaces under `go test` directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@levb levb marked this pull request as ready for review May 1, 2026 16:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4dcd8ca9b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/template/peerclient/seekable.go
Comment thread packages/orchestrator/pkg/server/main.go
Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get to fetch in a chunker, we should error out if the header is incomplete - defensive check to catch bugs early

Comment thread packages/shared/pkg/storage/header/metadata.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/build/header_load.go
Comment thread packages/orchestrator/pkg/sandbox/uploads.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uploads.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uploads.go Outdated
Comment thread packages/shared/pkg/storage/paths.go
Comment thread packages/orchestrator/pkg/sandbox/snapshot.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/sandbox.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/upload.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/block/local.go Outdated
@levb levb requested a review from dobrac May 1, 2026 21:03
levb added 2 commits May 1, 2026 16:12
- Serialize IncompletePendingUpload bit into V4 envelope; peer-server
  forces it on, StoreHeader refuses to persist it (V3 unchanged).
- Drop ParentBuildID from Snapshot/Upload; collectAncestorBuilds waits
  on the full mapping closure (skips self + uuid.Nil).
- Various review-driven simplifications
@dobrac dobrac added the bug Something isn't working label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants