Skip to content

feat(sandbox): pre-pause guest reclaim via envd#2551

Draft
ValentaTomas wants to merge 6 commits intomainfrom
feat/sandbox-pause-reclaim
Draft

feat(sandbox): pre-pause guest reclaim via envd#2551
ValentaTomas wants to merge 6 commits intomainfrom
feat/sandbox-pause-reclaim

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Adds an opt-in pre-pause step that runs sync, drop_caches, compact_memory, and fstrim -av on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in timeout --foreground -s KILL, so a stuck step (most realistically a slow sync on a large dirty backlog) cannot starve the rest — compact_memory always runs as long as its own cap is > 0.

Pausing FC is unaffected by an in-flight guest sync we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness.

Disabled by default — every per-step cap defaults to 0, so the chain is empty until an operator opts in step by step. The orchestrator skips the envd call entirely when the chain is empty. The outer Connect-Timeout-Ms is derived from the sum of per-step caps plus a small slack.

LD flags (all int, ms; 0 skips that step):

  • reclaim-sync-timeout-ms
  • reclaim-drop-caches-timeout-ms
  • reclaim-compact-memory-timeout-ms
  • reclaim-fstrim-timeout-ms

Pairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.

Run sync, drop_caches, compact_memory, and fstrim -av on the live VM
through envd's Process service immediately before pause to shrink the
memfile/rootfs diff snapshot. Composed as a single bash chain with
';' separators so each step is best-effort, the orchestrator owns the
deadline via Connect-Timeout-Ms, and all failures are non-fatal.

Gated by reclaim-on-pause-timeout-ms (LD int flag, ms; default 0 =
disabled). resume-build gains a matching --reclaim-timeout-ms override
for local exercise.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Touches the snapshot/pause path and runs additional guest commands via envd, so misconfigured timeouts or envd/process issues could impact pause latency (though failures are best-effort and non-fatal). Default behavior remains unchanged because all reclaim steps are disabled unless explicitly enabled via flags.

Overview
Adds an opt-in pre-pause “guest reclaim” step that runs sync, drop_caches, compact_memory, and fstrim inside the live VM via envd before snapshotting, with each step individually time-capped by new LaunchDarkly int flags so slow/stuck commands can’t block the pause flow. This centralizes envd Process execution in a new Sandbox.StartEnvdProcess helper and wires feature-flag access into Sandbox, while the resume-build tool gets a -reclaim switch that sets sensible default per-step caps for local runs.

Reviewed by Cursor Bugbot for commit db30f55. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Wraps each reclaim step (sync, drop_caches, compact_memory, fstrim) in
its own `timeout -s KILL`. A stuck step (most realistically a slow sync
on a large dirty backlog) cannot starve the rest, so compact_memory —
the diff-critical step — always runs as long as its cap is > 0.

Per-step ceilings are runtime-configurable via four new IntFlags:
- reclaim-sync-timeout-ms (default 500)
- reclaim-drop-caches-timeout-ms (default 200)
- reclaim-compact-memory-timeout-ms (default 1000)
- reclaim-fstrim-timeout-ms (default 500)

Setting any per-step cap to 0 skips that step. The outer
reclaim-on-pause-timeout-ms remains the master enable + Connect-Timeout-Ms
cap.

Pausing FC is unaffected by an in-flight guest sync that we time out:
FC only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness.
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
…uffix and join

Three fixes triggered by Cursor Bugbot review of the previous commit and
a follow-up question on the master flag:

1. Drop reclaim-on-pause-timeout-ms. Per-step caps (defaulting to 0)
   already encode "disabled by default": when every cap is 0 the script
   is empty and bestEffortReclaim short-circuits without calling envd.
   The outer Connect-Timeout-Ms is now derived from the sum of per-step
   caps + 500ms slack.

2. `timeout` accepts s/m/h/d (or fractional seconds), not `ms`. Format
   each cap as `%.3f` seconds (e.g. 500ms → 0.500). Without this, every
   step would silently fail with "invalid time interval".

3. Join parts with `; ` (not a single space) and append one trailing
   `true`. With space-joining, bash parsed `; true timeout ...` as
   `true` swallowing subsequent steps as args, so only `sync` ever ran.

4. Add `--foreground` to `timeout`. Without it, the SIGKILL doesn't
   reliably reach a stuck child when run from a non-interactive bash
   invoked by envd's Process service (verified empirically with
   `sh -c "sleep 5"` running its full 5s despite a 0.5s timeout).

resume-build CLI: replace --reclaim-timeout-ms with a `--reclaim` bool
that flips the per-step caps to sane local-test defaults (500/200/1000/500).
The previous refactor commit referenced Sandbox.StartEnvdProcess from
reclaim.go and resume-build/main.go but the helper file itself was
never tracked, breaking the orchestrator build (typecheck) on CI.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit db30f55. Configure here.

pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)

req := connect.NewRequest(&process.StartRequest{
Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: []string{"-c", script}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Login shell flag dropped during refactor to shared helper

Medium Severity

The old runCommandInSandbox used Args: []string{"-l", "-c", command} to invoke bash as a login shell, sourcing /etc/profile and user profile scripts. The new shared StartEnvdProcess uses Args: []string{"-c", script}, dropping the -l flag. This means user-provided commands via --cmd, --cmd-pause, or --cmd-signal-pause in the resume-build CLI no longer get a login shell environment, potentially breaking commands that depend on PATH or environment variables set in profile scripts.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit db30f55. Configure here.

ValentaTomas added a commit that referenced this pull request May 4, 2026
Adds `vm.compaction_proactiveness=0` to the base template's
`/etc/sysctl.conf` so kcompactd no longer runs background page
migrations in the guest.

With 2 MiB host-side hugepage backing of guest RAM, every migration
dirties a destination hugepage from the host UFFD's perspective and
lands in the next memfile diff — with no snapshot-aligned benefit. The
pre-pause `compact_memory` write (#2551) does the work deterministically
right before we capture state.

Existing templates inherit the change on rebuild.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants