feat(api): add PATCH /sandboxes/{id}/metadata by levb · Pull Request #2464 · e2b-dev/infra

levb · 2026-04-21T04:47:55Z

PATCH /sandboxes/{id}/metadata with merge semantics: string values upsert, null or "" remove, absent keys are left alone. Works only on running sandboxes.

RunningSandbox gets a metadata field so List ships the live value instead of the create-time APIStoredConfig snapshot (modeled after startAt/endAt). SandboxUpdateRequest gains an optional SandboxMetadataUpdate wrapper to distinguish "don't touch" from an explicit patch.

claude

test review

Replace metadata on a running sandbox. Full-replace semantics — keys absent from the body are removed. `null` and `{}` both clear the map. Paused sandboxes and merge semantics deferred. Wire-up: gRPC SandboxUpdateRequest gains an optional SandboxMetadataUpdate wrapper so proto3 can distinguish "don't touch" from "clear" (bare map<string,string> cannot).

ValentaTomas · 2026-04-21T06:09:41Z

Still thinking about the PUT vs PATCH as it is possible that people will need to fetch the previous values to overwrite part of it now, right?

levb · 2026-04-21T06:13:14Z

PATCH has its own issues. We could do both, but seems like client complexity. I am 0/5 since I don't know the use cases. PUT has the advantage of clarity.

ValentaTomas · 2026-04-21T06:14:20Z

As we already don't allow nested, maybe targeted PUT or PATCH might work here?

So it is, for example something like:

{
  "a": ...,
  "b": ...,
}

/PATCH /sandboxes/.../metadata { "metadata": null }

->

{}

{
  "a": 1,
  "b": 2,
}

/PATCH /sandboxes/.../metadata { "metadata": { "c": 3 } }

->

{
  "a": 1,
  "b": 2,
  "c": 3,
}

{
  "a": 1,
  "b": 2,
}

/PATCH /sandboxes/.../metadata { "metadata": { "a": null } }

->

{
  "b": 2,
}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20eaa6ea43

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ValentaTomas · 2026-04-21T06:16:06Z

I worry here, because if users already have things like user_id on the sandboxes, there will be cases where these ids will get wiped out and in case they stored more info there it is possible that they cannot get this info from other sources easily, apart from calling get metadata on the sandbox.

levb · 2026-04-21T14:35:48Z

ok, will convert to PATCH, NP 👍

- PUT → PATCH /sandboxes/{id}/metadata with merge semantics (k=v upserts, k=null or "" removes, absent keys untouched). Addresses ValentaTomas's concern that full-replace would force clients to round-trip existing tags to avoid wiping them. - Keep APIStoredConfig write-once (it was already deprecated and never mutated elsewhere). Add a metadata field to RunningSandbox so List ships the live sbx.GetAPIMetadata() instead of the create-time snapshot; the API-side nodemanager now reads from there. Checkpoint overrides the resumed sandbox's metadata with the live value after ResumeSandbox seeds from the stale proto. Fixes the claude[bot]/codex P1 about PATCH values reverting on resume and being invisible to List reconciles.

claude · 2026-04-21T16:19:11Z

+	sbx, err := o.sandboxStore.Update(ctx, teamID, sandboxID, updateFunc)
+	if err != nil {
+		switch {
+		case errors.As(err, &sbxNotRunningErr):
+			return &api.APIError{Code: http.StatusConflict, ClientMsg: utils.SandboxChangingStateMsg(sandboxID, sbxNotRunningErr.State), Err: err}
+		case errors.Is(err, sandbox.ErrNotFound):
+			return &api.APIError{Code: http.StatusNotFound, ClientMsg: utils.SandboxNotFoundMsg(sandboxID), Err: err}
+		default:
+			return &api.APIError{Code: http.StatusInternalServerError, ClientMsg: "Error patching sandbox metadata", Err: err}
+		}
+	}
+
+	return o.patchSandboxMetadataOnNode(ctx, sbx, merged)


🔴 The two-phase write in PatchSandboxMetadata (patch_metadata.go lines 47–59) is non-atomic in two distinct failure modes: (1) if patchSandboxMetadataOnNode fails after sandboxStore.Update succeeds, Redis permanently holds the new merged metadata while the node retains the old values — no rollback exists; (2) concurrent PATCH calls serialize at the Redis level but their gRPC calls are completely unserialized, so whichever gRPC arrives at the node last silently wins regardless of Redis write order, leaving Redis and the live sandbox permanently diverged. Fix (1) by capturing old metadata before sandboxStore.Update and issuing a compensating update in the error path; fix (2) by adding per-sandbox mutual exclusion around the (store.Update, patchOnNode) pair, or switching the node RPC to a conditional/versioned write.

Extended reasoning...

What the bug is and how it manifests

PatchSandboxMetadata in packages/api/internal/orchestrator/patch_metadata.go (lines 26–60) performs a two-phase write: it first commits the merged metadata to Redis via sandboxStore.Update (line 47), then propagates the result to the running sandbox on the orchestrator node via gRPC (patchSandboxMetadataOnNode, line 59). This two-step sequence is not atomic at either the failure-recovery or the concurrency level, producing two independent bugs.

Bug 1 – No rollback on gRPC failure (split-brain)

sandboxStore.Update acquires a Redis distributed lock, performs a read-modify-write, then releases the lock before returning. At that point Redis durably holds the new merged metadata. If the subsequent patchSandboxMetadataOnNode gRPC call fails for any reason (node unreachable, timeout, codes.NotFound, network partition), the function returns an error to the caller but there is no compensating sandboxStore.Update on the error path. Redis now permanently reflects the new metadata while the running sandbox has the old metadata. Because the API store is Redis-backed, this divergence survives API restarts. Any subsequent GET /sandboxes/{id} call will return the new (incorrect) metadata from Redis while the sandbox environment runs with the old values. The only recovery path is a subsequent successful end-to-end PATCH, which callers who received an error have no reason to believe is necessary.

Bug 2 – Concurrent PATCH calls silently overwrite each other at the node

The Redis-level sandboxStore.Update callback serializes concurrent writes correctly: each call acquires the distributed lock, reads the current state, computes a merged map, writes it, and releases the lock. However, the critical section ends before the gRPC call. Each call captures its own merged map via closure at the time of its Redis write, then fires the gRPC independently. Because gRPC calls are completely unserialized with respect to each other, whichever call's RPC arrives at the node last wins — regardless of which Redis write was later. The gRPC carries the full resolved map (not a delta), so the earlier call's stale snapshot fully replaces whatever the node received from the later call.

Why existing code does not prevent this

The orchestrator-side Update handler (sandboxes.go:336–347) correctly uses ApplyAllOrNone with compensating closures to protect the node's in-memory apiMetadata, but this cannot reach back to undo an already-committed Redis write. The Reconcile loop only kills orphaned sandboxes; it never pushes Redis metadata back to a running node. There is no per-sandbox mutex or sequencing primitive around the (store.Update, patchOnNode) pair in the API layer.

Step-by-step proof of Bug 1

Client PATCHes {env: staging} on sandbox S. sandboxStore.Update succeeds → Redis: {env: staging}. 2. patchSandboxMetadataOnNode gRPC fails (timeout). API returns 500. 3. Redis permanently holds {env: staging}; node still has {env: prod}. 4. Any GET /sandboxes/S now returns {env: staging} from Redis while the sandbox process runs with {env: prod}. 5. Divergence persists across API restarts until a successful end-to-end PATCH overwrites both.

Step-by-step proof of Bug 2

Initial metadata: {}. Call A patches {x:1}, call B patches {y:2} concurrently. 1. A's Redis write: Redis={x:1}, A holds merged={x:1}. 2. B's Redis write (reads A's result): Redis={x:1,y:2}, B holds merged={x:1,y:2}. 3. B's gRPC fires first → node apiMetadata={x:1,y:2}. 4. A's gRPC fires second → node apiMetadata={x:1}. Final: Redis={x:1,y:2} (correct) vs. node={x:1} (stale). The divergence is silent, returns no error, and persists for the sandbox lifetime.

How to fix

Bug 1: Capture old metadata before calling sandboxStore.Update and issue a compensating sandboxStore.Update restoring the old value in the patchSandboxMetadataOnNode error path. Alternatively, reverse the order: call patchSandboxMetadataOnNode first (already rollback-safe on the node side) and only commit to Redis on success. Bug 2: Wrap the (sandboxStore.Update, patchSandboxMetadataOnNode) pair in a per-sandbox mutex or keyed lock so concurrent PATCH calls for the same sandbox are serialized end-to-end.

claude · 2026-04-21T16:19:11Z


 		return nil, status.Errorf(codes.Internal, "error resuming sandbox after checkpoint: %s", err)
 	}
+	// ResumeSandbox seeds apiMetadata from the (immutable) APIStoredConfig
+	// snapshot — override with the live value so any PATCH carries over.
+	resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata())

 	// Collect prefetch data immediately after resume while it's most accurate
 	prefetchData, prefetchErr := resumedSbx.MemoryPrefetchData(ctx)


🔴 The Checkpoint handler calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648 after ResumeSandbox returns, but ResumeSandbox calls MarkRunning internally as its final step — making resumedSbx immediately discoverable via Sandboxes.Get() before the metadata override is applied. A concurrent PATCH gRPC Update call that lands in this window will apply the user's patched metadata to resumedSbx, which the Checkpoint handler then silently overwrites with the old sandbox's pre-patch metadata, causing Redis (already updated by sandboxStore.Update) and the node to diverge.

Extended reasoning...

What the bug is and how it manifests

Inside ResumeSandbox (sandbox.go:933), f.Sandboxes.MarkRunning(ctx, sbx) is the last substantive call before return sbx, nil. The moment MarkRunning completes, Map.Get() (map.go:81-88) will return resumedSbx for any caller because IsRunning() is now true. The Checkpoint handler at sandboxes.go:625 then waits for ResumeSandbox to return, and only afterward calls resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) at line 648. There is a real, non-zero window between those two events.

The specific code path that triggers it

Checkpoint handler calls ResumeSandbox(..., sbx.APIStoredConfig) (line 625).

Deep inside ResumeSandbox, line 933 executes f.Sandboxes.MarkRunning(ctx, sbx) — resumedSbx is now visible to concurrent Get() callers. Two goroutines are spawned immediately after, creating Go scheduling points where other goroutines can run.

ResumeSandbox returns to the Checkpoint handler.

The Checkpoint handler executes resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) (line 648) — note: sbx is the OLD sandbox; its GetAPIMetadata() reflects pre-PATCH state.

Why existing code does not prevent it

The Update gRPC handler (sandboxes.go:336-347) correctly saves and restores metadata via an ApplyAllOrNone closure, so the node-side rollback is safe. However, that rollback only protects the node state. The API layer's sandboxStore.Update (Redis) is committed before the gRPC call in patchSandboxMetadataOnNode, meaning Redis already holds the post-PATCH value when the race occurs. The rwmu lock inside SetAPIMetadata only prevents torn writes; it does nothing to prevent the Checkpoint handler from overwriting a legitimately applied PATCH.

What the impact would be

Consider: sbx.apiMetadata = {"env": "staging"} (the live value after a PATCH), while sbx.APIStoredConfig.Metadata = {"env": "prod"} (the immutable create-time snapshot). After the race:

Redis holds {"env": "staging"} (from sandboxStore.Update)

resumedSbx.apiMetadata is reset to {"env": "prod"} by the Checkpoint handler

Any subsequent GetAPIMetadata() call on the resumed sandbox returns stale data. The List RPC ships the wrong value, event telemetry records the wrong tags, and if the sandbox is checkpointed again the stale value propagates further.

How to fix it

Option A (preferred — atomic): Pass the live metadata into ResumeSandbox as a parameter and seed apiMetadata from it rather than from apiConfigToStore.GetMetadata(). This ensures the correct value is in place before MarkRunning makes the sandbox visible.

Option B (surgical): Extract MarkRunning out of ResumeSandbox and call it in the Checkpoint handler after resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()). This keeps the existing structure but requires callers to remember to call MarkRunning themselves.

Step-by-step proof of divergence

Sandbox is created with metadata {"env": "prod"}; both Redis and apiMetadata hold this value.

Client calls PATCH /sandboxes/{id}/metadata with {"env": "staging"}.

sandboxStore.Update commits {"env": "staging"} to Redis.

gRPC Update arrives at the node; sbx.SetAPIMetadata({"env": "staging"}) sets the live value.

Concurrently, client calls POST /sandboxes/{id}/checkpoints.

Checkpoint handler calls ResumeSandbox; inside, MarkRunning fires — resumedSbx is now publicly visible.

Before the Checkpoint handler reaches line 648, a second Update gRPC call (e.g., another concurrent PATCH) arrives and calls resumedSbx.SetAPIMetadata({"env": "staging"}).

Checkpoint handler executes resumedSbx.SetAPIMetadata(sbx.GetAPIMetadata()) — sbx still has {"env": "prod"} in its create-time snapshot or has not yet been updated.

Result: Redis = {"env": "staging"}, node resumedSbx.apiMetadata = {"env": "prod"}. Silent divergence.

dobrac

Why don't we do the implementation similar as for the network update?

dobrac · 2026-04-21T19:20:43Z

+	var err error
+	sandboxID, err = utils.ShortID(sandboxID)


Suggested change

var err error

sandboxID, err = utils.ShortID(sandboxID)

sandboxID, err := utils.ShortID(sandboxID)

dobrac · 2026-04-21T19:24:15Z

+
+  // Live user-facing metadata tags. Authoritative over config.metadata, which
+  // reflects only the create-time snapshot.
+  map<string, string> metadata = 5;


why od we need this change?

dobrac · 2026-04-21T19:26:23Z

+) *api.APIError {
+	ctx, span := tracer.Start(ctx, "patch-sandbox-metadata-on-node",
+		trace.WithAttributes(
+			attribute.String("instance.id", sbx.SandboxID),


telemetry.WithSandboxID

e2b-request-same-site-reviewers Bot assigned ValentaTomas Apr 21, 2026

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

Comment thread spec/openapi.yml Outdated

levb force-pushed the lev-update-metadata branch from d27a53e to 20eaa6e Compare April 21, 2026 05:57

levb marked this pull request as ready for review April 21, 2026 06:08

levb requested review from ValentaTomas, dobrac and jakubno as code owners April 21, 2026 06:08

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

claude Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/sandbox.go Outdated

Comment thread packages/api/internal/orchestrator/update_metadata.go Outdated

levb marked this pull request as draft April 21, 2026 14:35

levb changed the title ~~feat(api): add PUT /sandboxes/{id}/metadata~~ feat(api): add PATCH /sandboxes/{id}/metadata Apr 21, 2026

levb marked this pull request as ready for review April 21, 2026 16:00

claude Bot reviewed Apr 21, 2026

View reviewed changes

dobrac requested changes Apr 21, 2026

View reviewed changes

dobrac assigned dobrac and unassigned ValentaTomas Apr 21, 2026

dobrac added the feature New feature label Apr 21, 2026

levb marked this pull request as draft April 21, 2026 19:30

	var err error
	sandboxID, err = utils.ShortID(sandboxID)
	sandboxID, err := utils.ShortID(sandboxID)

Conversation

levb commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValentaTomas commented Apr 21, 2026

Uh oh!

levb commented Apr 21, 2026

Uh oh!

ValentaTomas commented Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ValentaTomas commented Apr 21, 2026

Uh oh!

Uh oh!

Uh oh!

levb commented Apr 21, 2026

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

dobrac left a comment

Choose a reason for hiding this comment

Uh oh!

dobrac Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

dobrac Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

dobrac Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

levb commented Apr 21, 2026 •

edited

Loading