Conversation
Replace metadata on a running sandbox. Full-replace semantics —
keys absent from the body are removed. `null` and `{}` both
clear the map. Paused sandboxes and merge semantics deferred.
Wire-up: gRPC SandboxUpdateRequest gains an optional
SandboxMetadataUpdate wrapper so proto3 can distinguish
"don't touch" from "clear" (bare map<string,string> cannot).
d27a53e to
20eaa6e
Compare
|
Still thinking about the PUT vs PATCH as it is possible that people will need to fetch the previous values to overwrite part of it now, right? |
|
|
|
As we already don't allow nested, maybe targeted PUT or PATCH might work here? So it is, for example something like: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20eaa6ea43
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I worry here, because if users already have things like user_id on the sandboxes, there will be cases where these ids will get wiped out and in case they stored more info there it is possible that they cannot get this info from other sources easily, apart from calling get metadata on the sandbox. |
|
ok, will convert to PATCH, NP 👍 |
- PUT → PATCH /sandboxes/{id}/metadata with merge semantics
(k=v upserts, k=null or "" removes, absent keys untouched).
Addresses ValentaTomas's concern that full-replace would force
clients to round-trip existing tags to avoid wiping them.
- Keep APIStoredConfig write-once (it was already deprecated and
never mutated elsewhere). Add a metadata field to RunningSandbox
so List ships the live sbx.GetAPIMetadata() instead of the
create-time snapshot; the API-side nodemanager now reads from
there. Checkpoint overrides the resumed sandbox's metadata with
the live value after ResumeSandbox seeds from the stale proto.
Fixes the claude[bot]/codex P1 about PATCH values reverting on
resume and being invisible to List reconciles.
| sbx, err := o.sandboxStore.Update(ctx, teamID, sandboxID, updateFunc) | ||
| if err != nil { | ||
| switch { | ||
| case errors.As(err, &sbxNotRunningErr): | ||
| return &api.APIError{Code: http.StatusConflict, ClientMsg: utils.SandboxChangingStateMsg(sandboxID, sbxNotRunningErr.State), Err: err} | ||
| case errors.Is(err, sandbox.ErrNotFound): | ||
| return &api.APIError{Code: http.StatusNotFound, ClientMsg: utils.SandboxNotFoundMsg(sandboxID), Err: err} | ||
| default: | ||
| return &api.APIError{Code: http.StatusInternalServerError, ClientMsg: "Error patching sandbox metadata", Err: err} | ||
| } | ||
| } | ||
|
|
||
| return o.patchSandboxMetadataOnNode(ctx, sbx, merged) |
There was a problem hiding this comment.
🔴 The two-phase write in PatchSandboxMetadata (patch_metadata.go lines 47–59) is non-atomic in two distinct failure modes: (1) if patchSandboxMetadataOnNode fails after sandboxStore.Update succeeds, Redis permanently holds the new merged metadata while the node retains the old values — no rollback exists; (2) concurrent PATCH calls serialize at the Redis level but their gRPC calls are completely unserialized, so whichever gRPC arrives at the node last silently wins regardless of Redis write order, leaving Redis and the live sandbox permanently diverged. Fix (1) by capturing old metadata before sandboxStore.Update and issuing a compensating update in the error path; fix (2) by adding per-sandbox mutual exclusion around the (store.Update, patchOnNode) pair, or switching the node RPC to a conditional/versioned write.
Extended reasoning...
What the bug is and how it manifests
PatchSandboxMetadata in packages/api/internal/orchestrator/patch_metadata.go (lines 26–60) performs a two-phase write: it first commits the merged metadata to Redis via sandboxStore.Update (line 47), then propagates the result to the running sandbox on the orchestrator node via gRPC (patchSandboxMetadataOnNode, line 59). This two-step sequence is not atomic at either the failure-recovery or the concurrency level, producing two independent bugs.
Bug 1 – No rollback on gRPC failure (split-brain)
sandboxStore.Update acquires a Redis distributed lock, performs a read-modify-write, then releases the lock before returning. At that point Redis durably holds the new merged metadata. If the subsequent patchSandboxMetadataOnNode gRPC call fails for any reason (node unreachable, timeout, codes.NotFound, network partition), the function returns an error to the caller but there is no compensating sandboxStore.Update on the error path. Redis now permanently reflects the new metadata while the running sandbox has the old metadata. Because the API store is Redis-backed, this divergence survives API restarts. Any subsequent GET /sandboxes/{id} call will return the new (incorrect) metadata from Redis while the sandbox environment runs with the old values. The only recovery path is a subsequent successful end-to-end PATCH, which callers who received an error have no reason to believe is necessary.
Bug 2 – Concurrent PATCH calls silently overwrite each other at the node
The Redis-level sandboxStore.Update callback serializes concurrent writes correctly: each call acquires the distributed lock, reads the current state, computes a merged map, writes it, and releases the lock. However, the critical section ends before the gRPC call. Each call captures its own merged map via closure at the time of its Redis write, then fires the gRPC independently. Because gRPC calls are completely unserialized with respect to each other, whichever call's RPC arrives at the node last wins — regardless of which Redis write was later. The gRPC carries the full resolved map (not a delta), so the earlier call's stale snapshot fully replaces whatever the node received from the later call.
Why existing code does not prevent this
The orchestrator-side Update handler (sandboxes.go:336–347) correctly uses ApplyAllOrNone with compensating closures to protect the node's in-memory apiMetadata, but this cannot reach back to undo an already-committed Redis write. The Reconcile loop only kills orphaned sandboxes; it never pushes Redis metadata back to a running node. There is no per-sandbox mutex or sequencing primitive around the (store.Update, patchOnNode) pair in the API layer.
Step-by-step proof of Bug 1
- Client PATCHes
{env: staging}on sandbox S.sandboxStore.Updatesucceeds → Redis:{env: staging}. 2.patchSandboxMetadataOnNodegRPC fails (timeout). API returns 500. 3. Redis permanently holds{env: staging}; node still has{env: prod}. 4. AnyGET /sandboxes/Snow returns{env: staging}from Redis while the sandbox process runs with{env: prod}. 5. Divergence persists across API restarts until a successful end-to-end PATCH overwrites both.
Step-by-step proof of Bug 2
Initial metadata: {}. Call A patches {x:1}, call B patches {y:2} concurrently. 1. A's Redis write: Redis={x:1}, A holds merged={x:1}. 2. B's Redis write (reads A's result): Redis={x:1,y:2}, B holds merged={x:1,y:2}. 3. B's gRPC fires first → node apiMetadata={x:1,y:2}. 4. A's gRPC fires second → node apiMetadata={x:1}. Final: Redis={x:1,y:2} (correct) vs. node={x:1} (stale). The divergence is silent, returns no error, and persists for the sandbox lifetime.
How to fix
Bug 1: Capture old metadata before calling sandboxStore.Update and issue a compensating sandboxStore.Update restoring the old value in the patchSandboxMetadataOnNode error path. Alternatively, reverse the order: call patchSandboxMetadataOnNode first (already rollback-safe on the node side) and only commit to Redis on success. Bug 2: Wrap the (sandboxStore.Update, patchSandboxMetadataOnNode) pair in a per-sandbox mutex or keyed lock so concurrent PATCH calls for the same sandbox are serialized end-to-end.
|
|
||
| return nil, status.Errorf(codes.Internal, "error resuming sandbox after checkpoint: %s", err) | ||
| } | ||
| // ResumeSandbox seeds apiMetadata from the (immutable) APIStoredConfig | ||
| // snapshot — override with the live value so any PATCH carries over. | ||
| resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) | ||
|
|
||
| // Collect prefetch data immediately after resume while it's most accurate | ||
| prefetchData, prefetchErr := resumedSbx.MemoryPrefetchData(ctx) |
There was a problem hiding this comment.
🔴 The Checkpoint handler calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648 after ResumeSandbox returns, but ResumeSandbox calls MarkRunning internally as its final step — making resumedSbx immediately discoverable via Sandboxes.Get() before the metadata override is applied. A concurrent PATCH gRPC Update call that lands in this window will apply the user's patched metadata to resumedSbx, which the Checkpoint handler then silently overwrites with the old sandbox's pre-patch metadata, causing Redis (already updated by sandboxStore.Update) and the node to diverge.
Extended reasoning...
What the bug is and how it manifests
Inside ResumeSandbox (sandbox.go:933), f.Sandboxes.MarkRunning(ctx, sbx) is the last substantive call before return sbx, nil. The moment MarkRunning completes, Map.Get() (map.go:81-88) will return resumedSbx for any caller because IsRunning() is now true. The Checkpoint handler at sandboxes.go:625 then waits for ResumeSandbox to return, and only afterward calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648. There is a real, non-zero window between those two events.
The specific code path that triggers it
- Checkpoint handler calls
ResumeSandbox(..., sbx.APIStoredConfig)(line 625). - Deep inside
ResumeSandbox, line 933 executesf.Sandboxes.MarkRunning(ctx, sbx)—resumedSbxis now visible to concurrentGet()callers. Two goroutines are spawned immediately after, creating Go scheduling points where other goroutines can run. ResumeSandboxreturns to the Checkpoint handler.- The Checkpoint handler executes
resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata())(line 648) — note:sbxis the OLD sandbox; itsGetAPIMetadata()reflects pre-PATCH state.
Why existing code does not prevent it
The Update gRPC handler (sandboxes.go:336-347) correctly saves and restores metadata via an ApplyAllOrNone closure, so the node-side rollback is safe. However, that rollback only protects the node state. The API layer's sandboxStore.Update (Redis) is committed before the gRPC call in patchSandboxMetadataOnNode, meaning Redis already holds the post-PATCH value when the race occurs. The rwmu lock inside SetAPIMetadata only prevents torn writes; it does nothing to prevent the Checkpoint handler from overwriting a legitimately applied PATCH.
What the impact would be
Consider: sbx.apiMetadata = {"env": "staging"} (the live value after a PATCH), while sbx.APIStoredConfig.Metadata = {"env": "prod"} (the immutable create-time snapshot). After the race:
- Redis holds
{"env": "staging"}(fromsandboxStore.Update) resumedSbx.apiMetadatais reset to{"env": "prod"}by the Checkpoint handler
Any subsequent GetAPIMetadata() call on the resumed sandbox returns stale data. The List RPC ships the wrong value, event telemetry records the wrong tags, and if the sandbox is checkpointed again the stale value propagates further.
How to fix it
Option A (preferred — atomic): Pass the live metadata into ResumeSandbox as a parameter and seed apiMetadata from it rather than from apiConfigToStore.GetMetadata(). This ensures the correct value is in place before MarkRunning makes the sandbox visible.
Option B (surgical): Extract MarkRunning out of ResumeSandbox and call it in the Checkpoint handler after resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()). This keeps the existing structure but requires callers to remember to call MarkRunning themselves.
Step-by-step proof of divergence
- Sandbox is created with metadata
{"env": "prod"}; both Redis andapiMetadatahold this value. - Client calls
PATCH /sandboxes/{id}/metadatawith{"env": "staging"}. sandboxStore.Updatecommits{"env": "staging"}to Redis.- gRPC
Updatearrives at the node;sbx.SetAPIMetadata({"env": "staging"})sets the live value. - Concurrently, client calls
POST /sandboxes/{id}/checkpoints. - Checkpoint handler calls
ResumeSandbox; inside,MarkRunningfires —resumedSbxis now publicly visible. - Before the Checkpoint handler reaches line 648, a second
UpdategRPC call (e.g., another concurrent PATCH) arrives and callsresumedSbx.SetAPIMetadata({"env": "staging"}). - Checkpoint handler executes
resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata())—sbxstill has{"env": "prod"}in its create-time snapshot or has not yet been updated. - Result: Redis =
{"env": "staging"}, noderesumedSbx.apiMetadata={"env": "prod"}. Silent divergence.
dobrac
left a comment
There was a problem hiding this comment.
Why don't we do the implementation similar as for the network update?
| var err error | ||
| sandboxID, err = utils.ShortID(sandboxID) |
There was a problem hiding this comment.
| var err error | |
| sandboxID, err = utils.ShortID(sandboxID) | |
| sandboxID, err := utils.ShortID(sandboxID) |
|
|
||
| // Live user-facing metadata tags. Authoritative over config.metadata, which | ||
| // reflects only the create-time snapshot. | ||
| map<string, string> metadata = 5; |
There was a problem hiding this comment.
why od we need this change?
| ) *api.APIError { | ||
| ctx, span := tracer.Start(ctx, "patch-sandbox-metadata-on-node", | ||
| trace.WithAttributes( | ||
| attribute.String("instance.id", sbx.SandboxID), |
PATCH /sandboxes/{id}/metadatawith merge semantics: string values upsert,nullor""remove, absent keys are left alone. Works only on running sandboxes.RunningSandboxgets a metadata field so List ships the live value instead of the create-timeAPIStoredConfigsnapshot (modeled after startAt/endAt).SandboxUpdateRequestgains an optional SandboxMetadataUpdate wrapper to distinguish "don't touch" from an explicit patch.