Skip to content

feat(api): add PATCH /sandboxes/{id}/metadata#2464

Draft
levb wants to merge 2 commits intomainfrom
lev-update-metadata
Draft

feat(api): add PATCH /sandboxes/{id}/metadata#2464
levb wants to merge 2 commits intomainfrom
lev-update-metadata

Conversation

@levb
Copy link
Copy Markdown
Contributor

@levb levb commented Apr 21, 2026

PATCH /sandboxes/{id}/metadata with merge semantics: string values upsert, null or "" remove, absent keys are left alone. Works only on running sandboxes.

RunningSandbox gets a metadata field so List ships the live value instead of the create-time APIStoredConfig snapshot (modeled after startAt/endAt). SandboxUpdateRequest gains an optional SandboxMetadataUpdate wrapper to distinguish "don't touch" from an explicit patch.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test review

Comment thread packages/orchestrator/pkg/server/sandboxes.go
Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated
Comment thread spec/openapi.yml Outdated
Replace metadata on a running sandbox. Full-replace semantics —
keys absent from the body are removed. `null` and `{}` both
clear the map. Paused sandboxes and merge semantics deferred.

Wire-up: gRPC SandboxUpdateRequest gains an optional
SandboxMetadataUpdate wrapper so proto3 can distinguish
"don't touch" from "clear" (bare map<string,string> cannot).
@levb levb force-pushed the lev-update-metadata branch from d27a53e to 20eaa6e Compare April 21, 2026 05:57
@levb levb marked this pull request as ready for review April 21, 2026 06:08
@ValentaTomas
Copy link
Copy Markdown
Member

Still thinking about the PUT vs PATCH as it is possible that people will need to fetch the previous values to overwrite part of it now, right?

@levb
Copy link
Copy Markdown
Contributor Author

levb commented Apr 21, 2026

PATCH has its own issues. We could do both, but seems like client complexity. I am 0/5 since I don't know the use cases. PUT has the advantage of clarity.

@ValentaTomas
Copy link
Copy Markdown
Member

As we already don't allow nested, maybe targeted PUT or PATCH might work here?

So it is, for example something like:

{
  "a": ...,
  "b": ...,
}

/PATCH /sandboxes/.../metadata { "metadata": null }

->

{}

{
  "a": 1,
  "b": 2,
}

/PATCH /sandboxes/.../metadata { "metadata": { "c": 3 } }

->

{
  "a": 1,
  "b": 2,
  "c": 3,
}

{
  "a": 1,
  "b": 2,
}

/PATCH /sandboxes/.../metadata { "metadata": { "a": null } }

->

{
  "b": 2,
}

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20eaa6ea43

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated
@ValentaTomas
Copy link
Copy Markdown
Member

I worry here, because if users already have things like user_id on the sandboxes, there will be cases where these ids will get wiped out and in case they stored more info there it is possible that they cannot get this info from other sources easily, apart from calling get metadata on the sandbox.

Comment thread packages/orchestrator/pkg/sandbox/sandbox.go Outdated
Comment thread packages/api/internal/orchestrator/update_metadata.go Outdated
@levb levb marked this pull request as draft April 21, 2026 14:35
@levb
Copy link
Copy Markdown
Contributor Author

levb commented Apr 21, 2026

ok, will convert to PATCH, NP 👍

- PUT → PATCH /sandboxes/{id}/metadata with merge semantics
  (k=v upserts, k=null or "" removes, absent keys untouched).
  Addresses ValentaTomas's concern that full-replace would force
  clients to round-trip existing tags to avoid wiping them.

- Keep APIStoredConfig write-once (it was already deprecated and
  never mutated elsewhere). Add a metadata field to RunningSandbox
  so List ships the live sbx.GetAPIMetadata() instead of the
  create-time snapshot; the API-side nodemanager now reads from
  there. Checkpoint overrides the resumed sandbox's metadata with
  the live value after ResumeSandbox seeds from the stale proto.
  Fixes the claude[bot]/codex P1 about PATCH values reverting on
  resume and being invisible to List reconciles.
@levb levb changed the title feat(api): add PUT /sandboxes/{id}/metadata feat(api): add PATCH /sandboxes/{id}/metadata Apr 21, 2026
@levb levb marked this pull request as ready for review April 21, 2026 16:00
Comment on lines +47 to +59
sbx, err := o.sandboxStore.Update(ctx, teamID, sandboxID, updateFunc)
if err != nil {
switch {
case errors.As(err, &sbxNotRunningErr):
return &api.APIError{Code: http.StatusConflict, ClientMsg: utils.SandboxChangingStateMsg(sandboxID, sbxNotRunningErr.State), Err: err}
case errors.Is(err, sandbox.ErrNotFound):
return &api.APIError{Code: http.StatusNotFound, ClientMsg: utils.SandboxNotFoundMsg(sandboxID), Err: err}
default:
return &api.APIError{Code: http.StatusInternalServerError, ClientMsg: "Error patching sandbox metadata", Err: err}
}
}

return o.patchSandboxMetadataOnNode(ctx, sbx, merged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The two-phase write in PatchSandboxMetadata (patch_metadata.go lines 47–59) is non-atomic in two distinct failure modes: (1) if patchSandboxMetadataOnNode fails after sandboxStore.Update succeeds, Redis permanently holds the new merged metadata while the node retains the old values — no rollback exists; (2) concurrent PATCH calls serialize at the Redis level but their gRPC calls are completely unserialized, so whichever gRPC arrives at the node last silently wins regardless of Redis write order, leaving Redis and the live sandbox permanently diverged. Fix (1) by capturing old metadata before sandboxStore.Update and issuing a compensating update in the error path; fix (2) by adding per-sandbox mutual exclusion around the (store.Update, patchOnNode) pair, or switching the node RPC to a conditional/versioned write.

Extended reasoning...

What the bug is and how it manifests

PatchSandboxMetadata in packages/api/internal/orchestrator/patch_metadata.go (lines 26–60) performs a two-phase write: it first commits the merged metadata to Redis via sandboxStore.Update (line 47), then propagates the result to the running sandbox on the orchestrator node via gRPC (patchSandboxMetadataOnNode, line 59). This two-step sequence is not atomic at either the failure-recovery or the concurrency level, producing two independent bugs.

Bug 1 – No rollback on gRPC failure (split-brain)

sandboxStore.Update acquires a Redis distributed lock, performs a read-modify-write, then releases the lock before returning. At that point Redis durably holds the new merged metadata. If the subsequent patchSandboxMetadataOnNode gRPC call fails for any reason (node unreachable, timeout, codes.NotFound, network partition), the function returns an error to the caller but there is no compensating sandboxStore.Update on the error path. Redis now permanently reflects the new metadata while the running sandbox has the old metadata. Because the API store is Redis-backed, this divergence survives API restarts. Any subsequent GET /sandboxes/{id} call will return the new (incorrect) metadata from Redis while the sandbox environment runs with the old values. The only recovery path is a subsequent successful end-to-end PATCH, which callers who received an error have no reason to believe is necessary.

Bug 2 – Concurrent PATCH calls silently overwrite each other at the node

The Redis-level sandboxStore.Update callback serializes concurrent writes correctly: each call acquires the distributed lock, reads the current state, computes a merged map, writes it, and releases the lock. However, the critical section ends before the gRPC call. Each call captures its own merged map via closure at the time of its Redis write, then fires the gRPC independently. Because gRPC calls are completely unserialized with respect to each other, whichever call's RPC arrives at the node last wins — regardless of which Redis write was later. The gRPC carries the full resolved map (not a delta), so the earlier call's stale snapshot fully replaces whatever the node received from the later call.

Why existing code does not prevent this

The orchestrator-side Update handler (sandboxes.go:336–347) correctly uses ApplyAllOrNone with compensating closures to protect the node's in-memory apiMetadata, but this cannot reach back to undo an already-committed Redis write. The Reconcile loop only kills orphaned sandboxes; it never pushes Redis metadata back to a running node. There is no per-sandbox mutex or sequencing primitive around the (store.Update, patchOnNode) pair in the API layer.

Step-by-step proof of Bug 1

  1. Client PATCHes {env: staging} on sandbox S. sandboxStore.Update succeeds → Redis: {env: staging}. 2. patchSandboxMetadataOnNode gRPC fails (timeout). API returns 500. 3. Redis permanently holds {env: staging}; node still has {env: prod}. 4. Any GET /sandboxes/S now returns {env: staging} from Redis while the sandbox process runs with {env: prod}. 5. Divergence persists across API restarts until a successful end-to-end PATCH overwrites both.

Step-by-step proof of Bug 2

Initial metadata: {}. Call A patches {x:1}, call B patches {y:2} concurrently. 1. A's Redis write: Redis={x:1}, A holds merged={x:1}. 2. B's Redis write (reads A's result): Redis={x:1,y:2}, B holds merged={x:1,y:2}. 3. B's gRPC fires first → node apiMetadata={x:1,y:2}. 4. A's gRPC fires second → node apiMetadata={x:1}. Final: Redis={x:1,y:2} (correct) vs. node={x:1} (stale). The divergence is silent, returns no error, and persists for the sandbox lifetime.

How to fix

Bug 1: Capture old metadata before calling sandboxStore.Update and issue a compensating sandboxStore.Update restoring the old value in the patchSandboxMetadataOnNode error path. Alternatively, reverse the order: call patchSandboxMetadataOnNode first (already rollback-safe on the node side) and only commit to Redis on success. Bug 2: Wrap the (sandboxStore.Update, patchSandboxMetadataOnNode) pair in a per-sandbox mutex or keyed lock so concurrent PATCH calls for the same sandbox are serialized end-to-end.

Comment on lines 643 to 651

return nil, status.Errorf(codes.Internal, "error resuming sandbox after checkpoint: %s", err)
}
// ResumeSandbox seeds apiMetadata from the (immutable) APIStoredConfig
// snapshot — override with the live value so any PATCH carries over.
resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata())

// Collect prefetch data immediately after resume while it's most accurate
prefetchData, prefetchErr := resumedSbx.MemoryPrefetchData(ctx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The Checkpoint handler calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648 after ResumeSandbox returns, but ResumeSandbox calls MarkRunning internally as its final step — making resumedSbx immediately discoverable via Sandboxes.Get() before the metadata override is applied. A concurrent PATCH gRPC Update call that lands in this window will apply the user's patched metadata to resumedSbx, which the Checkpoint handler then silently overwrites with the old sandbox's pre-patch metadata, causing Redis (already updated by sandboxStore.Update) and the node to diverge.

Extended reasoning...

What the bug is and how it manifests

Inside ResumeSandbox (sandbox.go:933), f.Sandboxes.MarkRunning(ctx, sbx) is the last substantive call before return sbx, nil. The moment MarkRunning completes, Map.Get() (map.go:81-88) will return resumedSbx for any caller because IsRunning() is now true. The Checkpoint handler at sandboxes.go:625 then waits for ResumeSandbox to return, and only afterward calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648. There is a real, non-zero window between those two events.

The specific code path that triggers it

  1. Checkpoint handler calls ResumeSandbox(..., sbx.APIStoredConfig) (line 625).
  2. Deep inside ResumeSandbox, line 933 executes f.Sandboxes.MarkRunning(ctx, sbx)resumedSbx is now visible to concurrent Get() callers. Two goroutines are spawned immediately after, creating Go scheduling points where other goroutines can run.
  3. ResumeSandbox returns to the Checkpoint handler.
  4. The Checkpoint handler executes resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) (line 648) — note: sbx is the OLD sandbox; its GetAPIMetadata() reflects pre-PATCH state.

Why existing code does not prevent it

The Update gRPC handler (sandboxes.go:336-347) correctly saves and restores metadata via an ApplyAllOrNone closure, so the node-side rollback is safe. However, that rollback only protects the node state. The API layer's sandboxStore.Update (Redis) is committed before the gRPC call in patchSandboxMetadataOnNode, meaning Redis already holds the post-PATCH value when the race occurs. The rwmu lock inside SetAPIMetadata only prevents torn writes; it does nothing to prevent the Checkpoint handler from overwriting a legitimately applied PATCH.

What the impact would be

Consider: sbx.apiMetadata = {"env": "staging"} (the live value after a PATCH), while sbx.APIStoredConfig.Metadata = {"env": "prod"} (the immutable create-time snapshot). After the race:

  • Redis holds {"env": "staging"} (from sandboxStore.Update)
  • resumedSbx.apiMetadata is reset to {"env": "prod"} by the Checkpoint handler

Any subsequent GetAPIMetadata() call on the resumed sandbox returns stale data. The List RPC ships the wrong value, event telemetry records the wrong tags, and if the sandbox is checkpointed again the stale value propagates further.

How to fix it

Option A (preferred — atomic): Pass the live metadata into ResumeSandbox as a parameter and seed apiMetadata from it rather than from apiConfigToStore.GetMetadata(). This ensures the correct value is in place before MarkRunning makes the sandbox visible.

Option B (surgical): Extract MarkRunning out of ResumeSandbox and call it in the Checkpoint handler after resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()). This keeps the existing structure but requires callers to remember to call MarkRunning themselves.

Step-by-step proof of divergence

  1. Sandbox is created with metadata {"env": "prod"}; both Redis and apiMetadata hold this value.
  2. Client calls PATCH /sandboxes/{id}/metadata with {"env": "staging"}.
  3. sandboxStore.Update commits {"env": "staging"} to Redis.
  4. gRPC Update arrives at the node; sbx.SetAPIMetadata({"env": "staging"}) sets the live value.
  5. Concurrently, client calls POST /sandboxes/{id}/checkpoints.
  6. Checkpoint handler calls ResumeSandbox; inside, MarkRunning fires — resumedSbx is now publicly visible.
  7. Before the Checkpoint handler reaches line 648, a second Update gRPC call (e.g., another concurrent PATCH) arrives and calls resumedSbx.SetAPIMetadata({"env": "staging"}).
  8. Checkpoint handler executes resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata())sbx still has {"env": "prod"} in its create-time snapshot or has not yet been updated.
  9. Result: Redis = {"env": "staging"}, node resumedSbx.apiMetadata = {"env": "prod"}. Silent divergence.

Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we do the implementation similar as for the network update?

Comment on lines +22 to +23
var err error
sandboxID, err = utils.ShortID(sandboxID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var err error
sandboxID, err = utils.ShortID(sandboxID)
sandboxID, err := utils.ShortID(sandboxID)

Comment on lines +135 to +138

// Live user-facing metadata tags. Authoritative over config.metadata, which
// reflects only the create-time snapshot.
map<string, string> metadata = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why od we need this change?

) *api.APIError {
ctx, span := tracer.Start(ctx, "patch-sandbox-metadata-on-node",
trace.WithAttributes(
attribute.String("instance.id", sbx.SandboxID),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry.WithSandboxID

@dobrac dobrac assigned dobrac and unassigned ValentaTomas Apr 21, 2026
@dobrac dobrac added the feature New feature label Apr 21, 2026
@levb levb marked this pull request as draft April 21, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants