refactor(uffd-tests): replace SIGUSR/env-var/pipe IPC with net/rpc/jsonrpc over unix socket#2519
Conversation
Pre-work for the upcoming REMOVE event handling PR. Lands as a runtime no-op on main: no production code changes, no new UFFD features enabled. Test-helper additions (all *_test.go): - testConfig.gated and per-op `async bool` - operationModeRemove / ServePause / ServeResume / Sleep - testHandler.servePause / serveResume hooks driven by control pipes (fd 7/8) into the helper subprocess - helper subprocess pause/resume goroutine that tears down and restarts the UFFD serve loop on demand - executeAll() async path that fans out goroutines and joins on test context - expectRemoved + checkDirtiness handling for read/write/remove ordering - executeRemove via unix.MADV_DONTNEED - unregister() helper around UFFDIO_UNREGISTER for cleanup - pageStateEntries() now drains settleRequests then takes the pageTracker RLock directly so the snapshot stays correct for future writers (e.g. REMOVE) that don't go through settleRequests operationModeRemove and the gated/async paths are exercised only by the follow-up PR; here they are dead code at runtime, kept to minimise the diff of that PR.
Codex review caught a deadlock: in GO_GATED mode, handling 'P' replaced the stop closure with one that drained `exitUffd`, but the deferred final cleanup still pointed at the original closure. If the helper exited between 'P' and 'R' (e.g. parent test failure, ctx cancel, SIGUSR1), the defer ran the original cleanup again and blocked forever on `<-exitUffd` — the channel had already been drained and the original Serve goroutine had already exited. The hung helper would in turn hang the parent's cmd.Wait(). Replace the two-variable (cleanup / stopServe) layout with a single mutex-guarded `stopFn` that is reset to a no-op after firing, and have both the gated 'P' handler and the deferred final cleanup go through the same `stopServe` wrapper. 'R' now just installs a fresh `stopFn` that drains the new goroutine.
Pull out everything that's not strictly needed for "gated pause/resume +
async ops" so PR T is the smallest possible scaffolding diff and the
follow-up REMOVE PR carries those pieces alongside the production code
that actually needs them:
- helpers_test.go: drop operationModeRemove, executeRemove, expectRemoved
and the REMOVE-aware checkDirtiness branch — none of the PR T tests
call them, and they only become meaningful once UFFD_FEATURE_EVENT_REMOVE
is set in the follow-up.
- fd_helpers_test.go: revert; drop unregister() helper. Without
UFFD_FEATURE_EVENT_REMOVE, MADV_DONTNEED never queues UFFD events, so
munmap doesn't block on un-acked events and there's nothing to
unregister around.
- cross_process_helpers_test.go:
- drop unregister(uffdFd, ...) call in late cleanup (same reason as
above).
- revert pageStateEntries() to the simple settleRequests.Lock pattern;
the rework was forward-looking for a REMOVE handler that doesn't go
through settleRequests and is dead code on main.
- keep cleanup → stopFn/stopServe restructure (mandatory for the
gated tear-down/respawn path) but restore the original
fdExit.SignalExit error handling that the previous version had
silently dropped.
- revert incidental cosmetic reformats (exitUffd placement, defer
multi-line) so the diff is purely additive in the unchanged paths.
…onrpc over unix socket
Replace the parent<->child cross-process userfaultfd test harness's
pile of single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plus
a SIGUSR1 shutdown signal and a SIGUSR2 page-state-snapshot trigger
with one bidirectional Unix domain socket carrying stdlib net/rpc +
net/rpc/jsonrpc. Per-call request IDs are handled by the standard
library, so the parent can issue concurrent RPCs while a handler is
parked (which the upcoming fault-barrier methods need). The only fd
we still hand off out-of-band is the userfaultfd itself (kernel
object, has to go through ExtraFiles); the initial source data is
written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.
Replaced surface:
- 4 pipes + 2 signals -> 1 socket + 1 RPC channel
- SIGUSR1 graceful exit -> Service.Shutdown
- SIGUSR2 + offsets pipe + binary.Write -> Service.PageStates
- ready pipe + ReadAll-on-close handshake -> Service.WaitReady
- gate-cmd / gate-sync byte protocol -> Service.ServePause / Service.ServeResume
Added on top of the same channel (used only by stacked race-tests PR):
- Service.InstallFaultBarrier(addr, point) -> token: arms a deterministic
barrier in the child's worker goroutine at one of two test-only hook
points (beforeWorkerRLockHook, beforeFaultPageHook).
- Service.WaitFaultHeld(token): blocks until the worker reaches the
barrier - the RPC reply IS the rendezvous.
- Service.ReleaseFault(token): lets the parked worker proceed.
Production code change is intentionally minimal: two nil-by-default
hook fields on Userfaultfd plus three nil-checked calls. Hooks are
only assigned in the child's crossProcessServe wiring, so production
callers see a single un-predictable load + branch per fault.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit fa7d2ed. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c39d391711
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Resolve conflicts in cross_process_helpers_test.go and helpers_test.go by taking the RPC-based test harness from this branch and folding in the post-#2475 fixes from main: - helpers_test.go: keep the RPC client/conn/cmd fields plus mutex as sync.RWMutex (so executeRead can RLock and executeWrite can Lock, matching the just-landed race fix) and the nil-guards in executeOperation for ServePause/ServeResume on non-gated handlers. - cross_process_helpers_test.go: keep the net/rpc + jsonrpc harness, add the Service.startServe idempotency guard (early-return when a serve goroutine is already running) so a stray duplicate ServeResume RPC can't leak an untracked Serve goroutine, mirroring the running flag fix on main. Also add the unregister(uffdFd, ...) cleanup at the end of the cmd-wait t.Cleanup so a future REMOVE-events test doesn't hang in munmap on un-acked events. - fd_helpers_test.go: auto-merged to keep the unregister helper from main (the RPC harness now uses it).
Replace the two test-only func fields beforeWorkerRLockHook and beforeFaultPageHook on Userfaultfd with a single atomic.Pointer[testHooks] holding a struct of optional callbacks. Lets PR #2521's race tests add a new hook without touching the production struct, and shrinks the test-surface comment from 17 lines to 3. No production behavior change; hot-path cost is identical (one atomic load + nil check vs one nil-check per hook today).
…to RPC
Reduce parent ↔ child side channels from {7 env config vars, 1 helper
flag, 1 uffd ExtraFile, 1 content tmp file, 1 RPC socket} to
{2 env vars, 1 uffd ExtraFile, 1 RPC socket}. Everything that was
configured via env or the content file now flows through a single
Lifecycle.Bootstrap RPC.
Drops:
- 7 config env vars (GO_UFFD_SOCKET kept; GO_UFFD_CONTENT, GO_UFFD_MMAP_*,
GO_UFFD_ALWAYS_WP, GO_UFFD_GATED, GO_UFFD_BARRIERS removed)
- the content.bin tmp file (passed as []byte in BootstrapArgs)
- the testConfig.gated flag and its conditional RPC wiring
- childForkMu — replaced with a single F_DUPFD_CLOEXEC fcntl that
removes the dup-then-set-cloexec race window the mutex was guarding
Splits the 752-line cross_process_helpers_test.go into seven cohesive
files: harness_{parent,child,client}_test.go for the wire/IPC plumbing
and rpc_{lifecycle,paging,barriers}_test.go for the three RPC services.
The Service god-struct becomes three small services multi-registered
with net/rpc; the parent calls Lifecycle.Bootstrap, Paging.Pause,
Barriers.Install, etc. via a typed *harnessClient wrapper.
testHandler now holds a single *harnessClient instead of
{client, conn, cmd} plus servePause/serveResume/pageStatesOnce
function fields; pageStatesOnce becomes a regular method that
delegates to the client.
Test-only; no production code changes. go vet / golangci-lint /
\`sudo go test -race\` all clean (×2).
Replace net.Listen/Accept/Dial over a tmp Unix socket with a
socketpair(2) fd pair handed to the child via cmd.ExtraFiles.
Removes:
- envSocketPath ("GO_UFFD_SOCKET") — only envHelperFlag remains
- net.ListenConfig.Listen on a t.TempDir() path
- listener.Accept with its 10s timeout select race
- the temp socket path inode under t.TempDir()
Side channels are now {GO_TEST_HELPER_PROCESS env var, uffd fd at fd 3,
rpc socket fd at fd 4}.
The previous Once-suffix was kept temporarily for call-site stability; the method has always been callable any number of times. Rename the method and update the eight call sites.
…to internal/rpcharness
Move the parts of the test harness that don't touch *Userfaultfd
internals into a new internal subpackage:
packages/.../uffd/userfaultfd/internal/rpcharness/
wire.go — Bootstrap/PageStates/FaultBarrier/Token args+replies
client.go — typed Client wrapping *rpc.Client (Bootstrap, Pause,
Resume, PageStates, InstallBarrier, WaitFaultHeld,
ReleaseFault, Shutdown, Close)
barriers.go — Registry + Point + HookFor
The RPC service implementations stay as _test.go in userfaultfd
(they need access to unexported pageStateEntries, settleRequests,
pageTracker, defaultCopyMode etc.). They become smaller — wire types
no longer locally defined; barrier registry consumed via interface
boundary; testHandler holds a single *rpcharness.Client instead of
{*rpc.Client, servePause, serveResume, pageStates} func fields.
internal/ scoping ensures the testutil package can never leak into a
production import path.
Test-only.
* Production userfaultfd.go drops the testHooks struct, the two callBefore* methods, and their multi-line test-explaining comments. What remains is a single atomic.Pointer[func(uintptr, faultPhase)] field plus inline call sites — the floor without build tags. * rpcharness.Registry.HookFor(point) collapses into Hook(): one closure dispatches by (addr, point) instead of binding the point in the closure. * Bootstrap wires a single SetTestFaultHook with an explicit faultPhase → rpcharness.Point switch, since the two enums use different numeric values intentionally.
…de, rename NewUserfaultfdFromFd → NewFromFd Pure cleanup with two intentional behavior-equivalent tweaks: - Flatten Paging.Resume to call state.startServe(), the locked wrapper around startServeLocked, mirroring the Pause/stopServe pattern. - Align rpcharness.Point iota with userfaultfd.faultPhase iota (both now 0/1) so the test fault hook can pass values through with a numeric cast instead of a switch. The renamed constructor keeps "wraps an existing fd" intent visible at the call site: every caller today wraps an fd received over a socket or via ExtraFiles; nobody asks the constructor to also run the userfaultfd(2) syscall and register the region.
Keep the public symbol unchanged across all call sites so the PR review focuses on the substantive harness simplification rather than a name change that doesn't buy anything load-bearing.
…rap failure Address two Cursor Bugbot findings on #2519: * harness_parent_test.go: register the helper-process cleanup immediately after rpcharness.NewClient(parentConn) so a failure in client.Bootstrap or client.WaitReady still reaps the child and closes the rpc socket. * rpc_services_test.go: startServeLocked now returns error; Lifecycle.Bootstrap and Paging.Resume propagate it instead of swallowing fdexit.New failures to stderr while telling the parent the serve loop is up.
… read Mirror PrefetchData's drain-and-hold pattern. The previous Lock+immediate- Unlock left a window where new workers could enter faultPage and mutate pageTracker between the unlock and the pageTracker.mu.RLock acquisition, breaking the "settled snapshot" intent the comment claimed. Resolves Cursor Bugbot finding on rpc_services_test.go.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee0186b5b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Sibling of the existing testutils helpers; rename + move in one step. The path now makes the test-only origin obvious. Drops internal/ scoping but no production caller exists or would want this package.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
- 🔴
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/internal/rpcharness/barriers.go:103-123—Registry.Release(token)atinternal/rpcharness/barriers.go:110unconditionally deletesbyKey[{s.addr, s.point}]when the token's slot exists, butInstall(line 68) overwrites that key on every Install for the same(addr, point), so by the time a stale token is released the key may map to a newer barrier B. AfterRelease(A),byKey[K]is gone, so when a worker faults at(addr, point)Hook()callslookupByAddr→ nil and never parks atslotB.arrived, leaving any laterWaitArrived(B)to block forever (withWaitHeldcallingWaitArrived(context.Background(), …)fromrpc_services_test.go, "forever" is literal).
The fix is the standard compare-and-delete: only delete byKey[{s.addr, s.point}] when r.byKey[{s.addr, s.point}] == token. The same defect was previously flagged P2 by chatgpt-codex on the now-deleted cross_process_helpers_test.go and migrated unfixed into barriers.go.
Extended reasoning...
Bug
`Registry.Release` deletes the `byKey` reverse-index entry for a token whose entry may have been replaced by a later `Install`. The mapping in `byKey` is single-valued per `(addr, point)` — `Install` (line 67-68) unconditionally overwrites it — so re-installing the same key supersedes the prior token's reverse-index entry. `Release` does not check that `byKey[K]` still maps to the released token before deleting, so it happily wipes the entry that points to a different live token.
Concrete trace
- `Install(addr, point)` → `token A=1`. State: `tokens={1:slotA}`, `byKey[K]=1` where `K={addr, point}`.
- `Install(addr, point)` again → `token B=2`. `Install` overwrites: `byKey[K]=2`. State: `tokens={1:slotA, 2:slotB}`, `byKey[K]=2`.
- `Release(1)` runs:
- line 107: `s = tokens[1] = slotA`
- line 108: `delete(tokens, 1)` — fine.
- line 110: `delete(byKey, key{slotA.addr, slotA.point})` = `delete(byKey, K)`. But `byKey[K]` is `2`, not `1`. The reverse-index entry pointing at the still-live B is gone.
- line 121: closes `slotA.release` — fine, slotA is being torn down.
- State after Release: `tokens={2:slotB}`, `byKey={}`. Token B exists but is unreachable from the address-side index.
- A worker fault arrives at `(addr, point)`. The hook at `barriers.go:142-155` calls `lookupByAddr(addr, point)` → `byKey` lookup misses → returns `nil` → hook returns immediately without parking.
- `slotB.arrived` is never closed.
- Parent calls `WaitArrived(B)`. The select at line 95 blocks on `s.arrived` indefinitely. Because `Barriers.WaitHeld` in `rpc_services_test.go` passes `context.Background()`, the ctx branch never fires either.
- The test eventually hits `-timeout` and the whole binary is killed.
Why existing code does not prevent this
- `Install` does not check whether `byKey[K]` is already populated, so the supersede case is silent rather than rejected.
- `Release` does not consult `byKey[K]` before deleting — the key chosen for deletion is `{s.addr, s.point}` from the released slot, not from the current map state.
- `WaitArrived` looks up the slot directly via `tokens[token]` so it does see slotB, but it has no way to know slotB is unreachable from the hook side. It just sits on the unbuffered channel.
- `Hook` uses `lookupByAddr` (the only addr-keyed accessor), so once `byKey` is wrong every fault at that address misses the parked slot.
Impact
The barrier API is brand-new scaffolding in this PR; the immediate `*_test.go` files don't exercise an `Install A → Install B → Release A` sequence. But the PR description explicitly stages this surface for the next stacked PR's race tests ("The barrier methods are wired up but the test-side wrappers around them ship with the race-tests PR — kept here only so PR 3 lands as pure additions"). The race-tests PR will be exactly the population that does non-trivial `Install`/`Release` ordering, and a manifestation here would be an opaque test hang — `WaitArrived` parks on a channel with no diagnostic, the 5-minute test timeout kills the whole binary, and CI logs show only `panic: test timed out`. That is the worst possible failure mode for race-test scaffolding because every race test presents the same symptom regardless of which one tripped this. Right time to fix it is in the PR that introduces the surface, while it is one line.
Fix
Standard compare-and-delete:
```go
func (r *Registry) Release(token uint64) {
r.mu.Lock()
s, ok := r.tokens[token]
delete(r.tokens, token)
if ok {
k := key{s.addr, s.point}
if r.byKey[k] == token {
delete(r.byKey, k)
}
}
r.mu.Unlock()
// ...
}
```
Only delete the reverse-index entry when it still maps to the released token. Mirrors the standard pattern when keys can be re-used.
Prior-art note
`chatgpt-codex` flagged this same bug at P2 on the deleted `cross_process_helpers_test.go` (see "Preserve byKey entry when releasing superseded barrier" in the PR timeline). The bug migrated unfixed into `barriers.go` during the rpcharness extraction.
🔬 also observed by chatgpt-codex-connector
Address two ChatGPT-Codex-Connector findings on commit ee0186b: * harness_parent_test.go: register helper cleanup immediately after cmd.Start succeeds. The closure picks the graceful (client.Shutdown) or hard (cmd.Process.Kill) path depending on whether the rpc client has been constructed yet, so a net.FileConn failure no longer leaks the child or the uffd registration. * rpc_services_test.go: only create the barrier Registry when args.Barriers is true. With Barriers=false, Barriers.Install used to succeed and Barriers.WaitHeld then blocked forever; now it errors fast via the existing nil check, with a clearer message.
…ensive locking, Release token-reuse Address four review findings on PR #2519, all real bugs introduced by the harness rewrite: * harness_parent_test.go: register the kernel uffd close, the unregister, and the child-reap cleanups at their points of acquisition. Failures in configureApi/register/FcntlInt/Socketpair/ cmd.Start no longer leak fds or zombie processes. * rpc_services_test.go: replace the harnessState.shutdown chan with a shutdownCtx; Barriers.WaitHeld now derives its ctx from it, so a misconfigured install or a child crash during a fault no longer parks the rpc handler forever (and SIGKILLs the test binary via Go's -timeout reaper). * rpc_services_test.go: re-add pageTracker.mu.RLock under the settleRequests.Lock in pageStateEntries. settleRequests alone is sufficient today but PR 2's REMOVE handler will write pageTracker.m without going through settleRequests; the inner RLock makes the snapshot defensible against that change. * testutils/testharness/barriers.go: Registry.Release now compare-and-deletes byKey, so an earlier token's release no longer clobbers a later Install for the same (addr, point) and the later WaitArrived no longer blocks forever.
…_000_000 Reverts the 1M -> 10K reduction made in #2461. Intermittent failures under 1M are bugs to be hunted, not silenced by lowering the load. The other parallel/serial variants in these files have legitimately been at 10K since their introduction in November 2025; only the two TestParallelMissing(/Write) cases were reduced.
…MissingWriteWithPrefault loads to 1_000_000 Continues the regression sweep started in 4681ab2. The original PR #1415 (88c3960, Nov 3 2025) introduced these tests at 1_000_000 operations. PR #1450 (6ee2ebb, Nov 7 2025, "Modify UFFD tests to run serve loop in a separate process") cut them all to 10_000 in the same commit that switched from in-process to cross-process tests, with no commit message justification. The cross-process refactor itself does not require lower iteration counts: the per-iteration work is still in-process memory access; only the page-state snapshot RPC is cross-process and is called once per test. Reverts: - TestSerialMissing: 10_000 -> 1_000_000 - TestSerialMissingWrite: 10_000 -> 1_000_000 - TestParallelMissingWriteWithPrefault: 10_000 -> 1_000_000 Per-test wall time on this branch (sudo go test, no -race): - TestSerialMissing ~3s - TestSerialMissingWrite ~3s - TestParallelMissingWriteWithPrefault ~2s TestParallelMissingWithPrefault stays at 10_000: it was only ever parallelOperations := 10 in #1415, then bumped to 10_000 in #1450, so 10_000 is already above its historical baseline.
Address review findings from PR #2519: * rpc_services_test.go: stopServe now snapshots and clears s.stop under the lock, then drops the lock before calling stop() (which blocks on <-done). Holding s.mu across the drain blocked every concurrent RPC handler that takes s.mu, and would deadlock the upcoming race tests where WaitFaultHeld needs s.mu while a worker is parked at a barrier waiting for a paired Pause to drain. * harness_child_test.go: guard the child's rpc conn close with a sync.Once so the deferred safety-net close (needed for early returns from rpc.Register) does not double-close after the explicit close that unblocks server.ServeCodec on the success path. Audit of the rest of rpc_services_test.go found no other lock-during-blocking patterns: releaseAllBarriers, Paging.States, and Barriers.{Install,WaitHeld,Release} already snapshot under the lock and call out unlocked; Lifecycle.{Bootstrap,Shutdown,Resume} only do non-blocking work under the lock (channel send via context.cancel, fdexit.New, goroutine spawn).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7683f88. Configure here.
…lpers Pure cleanup; no behavior change. - trim file-header and method docs across testharness/ + harness_*_test.go - inline releaseAllBarriers (single caller) - preserve drain-fence and REMOVE-defensive WHYs on pageStateEntries

Replace the cross-process userfaultfd test harness's pipes + signals (
SIGUSR1shutdown,SIGUSR2page-state snapshot, ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying stdlibnet/rpc+net/rpc/jsonrpc. The userfaultfd and the rpc socketpair half are passed viaExtraFiles.Production change: one
atomic.Pointer[func(uintptr, faultPhase)]field onUserfaultfdand three nil-checked inline call sites. Test builds install the hook viaSetTestFaultHookdefined in a_test.gofile.Stacked follow-ups:
UFFD_EVENT_REMOVEhandling + matrix tests — feat(uffd): handle UFFD_EVENT_REMOVE; track per-page state; race-safe COPY #2520