Skip to content

feat(uffd,fc): balloon free-page-hinting + envd reclaim on pause#2550

Closed
ValentaTomas wants to merge 1 commit intofeat/uffd-fc-free-page-reporting-integrationfrom
feat/uffd-fc-free-page-hinting-and-reclaim
Closed

feat(uffd,fc): balloon free-page-hinting + envd reclaim on pause#2550
ValentaTomas wants to merge 1 commit intofeat/uffd-fc-free-page-reporting-integrationfrom
feat/uffd-fc-free-page-hinting-and-reclaim

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Adds a pre-pause guest reclaim step (sync + drop_caches + compact_memory + fstrim, run via the existing envd Process service with Connect-Timeout-Ms) and a virtio-balloon free-page-hinting drain that MADV_DONTNEEDs the freed pages out of the memfile before the snapshot.

The balloon is installed with FPH armed whenever FPR is on; both behaviors are off by default and gated at runtime by separate LD flags (free-page-hinting, reclaim-on-pause), so they can be flipped without rebuilding templates.

Pause order: bestEffortReclaim (guest) → DrainBalloon (host-initiated FPH) → PauseSnapshot. Reclaim is best-effort; on Connect-Timeout-Ms envd kills bash, the in-flight kernel write finishes, remaining steps are skipped. FPH drain has its own ~1.5s ceiling and is non-fatal — failures fall through to pause.

Depends on #2541#2545#2520.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 3, 2026

PR Summary

Medium Risk
Touches the sandbox pause/snapshot path and Firecracker device configuration, which can affect snapshot correctness and pause latency; mitigated by being gated behind new default-off timeout flags and treating failures as non-fatal.

Overview
Adds an optional pre-pause optimization that (when enabled via new timeout-based feature flags) asks envd to run a best-effort guest reclaim script and then triggers a Firecracker virtio-balloon free-page-hinting drain before pausing and snapshotting, aiming to reduce resident memory in snapshots. This includes new Firecracker API plumbing to install the balloon with hinting armed, poll for hinting completion, wiring the feature-flag client into Sandbox, and adding resume-build CLI overrides for the two new timeout flags.

Reviewed by Cursor Bugbot for commit 7efc0d9. Bugbot is set up for automated code reviews on this repo. Configure here.

@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-hinting-and-reclaim branch 4 times, most recently from cfb09ae to 506542b Compare May 3, 2026 23:31
Comment thread packages/orchestrator/pkg/sandbox/fc/client.go
@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-hinting-and-reclaim branch 3 times, most recently from 67032c5 to 3fe4149 Compare May 3, 2026 23:41
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3fe4149. Configure here.

// acknowledges or ctx fires. No-op when the balloon wasn't installed.
func (p *Process) DrainBalloon(ctx context.Context) error {
if !p.balloonInstalled {
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

balloonInstalled never set on resume path breaks DrainBalloon

High Severity

DrainBalloon checks p.balloonInstalled and returns nil if false, but balloonInstalled is only set to true in the Create path (line 450). The Resume path never sets it, even though resumed VMs inherit the balloon device from the snapshot. This means DrainBalloon is a permanent no-op for all resumed sandboxes — which is the primary use case for the FPH drain feature (live sandbox pause via the server, template layer builds via ResumeSandbox, and the resume-build CLI tool).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3fe4149. Configure here.

Adds a pre-pause guest reclaim step (sync + drop_caches + compact_memory +
fstrim, run via the existing envd Process service with Connect-Timeout-Ms)
and a virtio-balloon free-page-hinting drain to MADV_DONTNEED freed pages
out of the memfile before the snapshot.

The balloon is installed with FPH=true whenever FPR is on; both behaviors
are off by default and gated by separate LD flags (free-page-hinting,
reclaim-on-pause), so they can be flipped at runtime without rebuilding
templates.
@ValentaTomas
Copy link
Copy Markdown
Member Author

Superseded by the splits — closing.

@ValentaTomas ValentaTomas deleted the feat/uffd-fc-free-page-hinting-and-reclaim branch May 4, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants