Skip to content

refactor(uffd-tests): replace SIGUSR/env-var/pipe IPC with net/rpc/jsonrpc over unix socket#2519

Merged
ValentaTomas merged 23 commits intomainfrom
refactor/uffd-test-harness
May 2, 2026
Merged

refactor(uffd-tests): replace SIGUSR/env-var/pipe IPC with net/rpc/jsonrpc over unix socket#2519
ValentaTomas merged 23 commits intomainfrom
refactor/uffd-test-harness

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Apr 29, 2026

Replace the cross-process userfaultfd test harness's pipes + signals (SIGUSR1 shutdown, SIGUSR2 page-state snapshot, ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying stdlib net/rpc + net/rpc/jsonrpc. The userfaultfd and the rpc socketpair half are passed via ExtraFiles.

Production change: one atomic.Pointer[func(uintptr, faultPhase)] field on Userfaultfd and three nil-checked inline call sites. Test builds install the hook via SetTestFaultHook defined in a _test.go file.

Stacked follow-ups:

Pre-work for the upcoming REMOVE event handling PR. Lands as a runtime
no-op on main: no production code changes, no new UFFD features enabled.

Test-helper additions (all *_test.go):

- testConfig.gated and per-op `async bool`
- operationModeRemove / ServePause / ServeResume / Sleep
- testHandler.servePause / serveResume hooks driven by control pipes
  (fd 7/8) into the helper subprocess
- helper subprocess pause/resume goroutine that tears down and restarts
  the UFFD serve loop on demand
- executeAll() async path that fans out goroutines and joins on test
  context
- expectRemoved + checkDirtiness handling for read/write/remove ordering
- executeRemove via unix.MADV_DONTNEED
- unregister() helper around UFFDIO_UNREGISTER for cleanup
- pageStateEntries() now drains settleRequests then takes the
  pageTracker RLock directly so the snapshot stays correct for future
  writers (e.g. REMOVE) that don't go through settleRequests

operationModeRemove and the gated/async paths are exercised only by the
follow-up PR; here they are dead code at runtime, kept to minimise the
diff of that PR.
Codex review caught a deadlock: in GO_GATED mode, handling 'P' replaced
the stop closure with one that drained `exitUffd`, but the deferred
final cleanup still pointed at the original closure. If the helper exited
between 'P' and 'R' (e.g. parent test failure, ctx cancel, SIGUSR1), the
defer ran the original cleanup again and blocked forever on `<-exitUffd`
— the channel had already been drained and the original Serve goroutine
had already exited. The hung helper would in turn hang the parent's
cmd.Wait().

Replace the two-variable (cleanup / stopServe) layout with a single
mutex-guarded `stopFn` that is reset to a no-op after firing, and have
both the gated 'P' handler and the deferred final cleanup go through
the same `stopServe` wrapper. 'R' now just installs a fresh `stopFn`
that drains the new goroutine.
Pull out everything that's not strictly needed for "gated pause/resume +
async ops" so PR T is the smallest possible scaffolding diff and the
follow-up REMOVE PR carries those pieces alongside the production code
that actually needs them:

- helpers_test.go: drop operationModeRemove, executeRemove, expectRemoved
  and the REMOVE-aware checkDirtiness branch — none of the PR T tests
  call them, and they only become meaningful once UFFD_FEATURE_EVENT_REMOVE
  is set in the follow-up.
- fd_helpers_test.go: revert; drop unregister() helper. Without
  UFFD_FEATURE_EVENT_REMOVE, MADV_DONTNEED never queues UFFD events, so
  munmap doesn't block on un-acked events and there's nothing to
  unregister around.
- cross_process_helpers_test.go:
  - drop unregister(uffdFd, ...) call in late cleanup (same reason as
    above).
  - revert pageStateEntries() to the simple settleRequests.Lock pattern;
    the rework was forward-looking for a REMOVE handler that doesn't go
    through settleRequests and is dead code on main.
  - keep cleanup → stopFn/stopServe restructure (mandatory for the
    gated tear-down/respawn path) but restore the original
    fdExit.SignalExit error handling that the previous version had
    silently dropped.
  - revert incidental cosmetic reformats (exitUffd placement, defer
    multi-line) so the diff is purely additive in the unchanged paths.
…onrpc over unix socket

Replace the parent<->child cross-process userfaultfd test harness's
pile of single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plus
a SIGUSR1 shutdown signal and a SIGUSR2 page-state-snapshot trigger
with one bidirectional Unix domain socket carrying stdlib net/rpc +
net/rpc/jsonrpc. Per-call request IDs are handled by the standard
library, so the parent can issue concurrent RPCs while a handler is
parked (which the upcoming fault-barrier methods need). The only fd
we still hand off out-of-band is the userfaultfd itself (kernel
object, has to go through ExtraFiles); the initial source data is
written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.

Replaced surface:
  - 4 pipes + 2 signals  ->  1 socket + 1 RPC channel
  - SIGUSR1 graceful exit                      ->  Service.Shutdown
  - SIGUSR2 + offsets pipe + binary.Write      ->  Service.PageStates
  - ready pipe + ReadAll-on-close handshake    ->  Service.WaitReady
  - gate-cmd / gate-sync byte protocol         ->  Service.ServePause / Service.ServeResume

Added on top of the same channel (used only by stacked race-tests PR):
  - Service.InstallFaultBarrier(addr, point) -> token: arms a deterministic
    barrier in the child's worker goroutine at one of two test-only hook
    points (beforeWorkerRLockHook, beforeFaultPageHook).
  - Service.WaitFaultHeld(token): blocks until the worker reaches the
    barrier - the RPC reply IS the rendezvous.
  - Service.ReleaseFault(token): lets the parked worker proceed.

Production code change is intentionally minimal: two nil-by-default
hook fields on Userfaultfd plus three nil-checked calls. Hooks are
only assigned in the child's crossProcessServe wiring, so production
callers see a single un-predictable load + branch per fault.
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 29, 2026

PR Summary

Medium Risk
Mostly refactors test-only cross-process coordination, but it adds new hook call sites inside Userfaultfd.Serve/faultPage, which are performance- and correctness-sensitive paths even though guarded by nil checks.

Overview
Refactors the cross-process userfaultfd test harness to use a single Unix socket carrying net/rpc/jsonrpc for bootstrap, pause/resume, and page-state queries, replacing the prior signal/env-var/pipe-based protocol. It also introduces an optional test-only per-fault hook and barrier registry to deterministically park/release fault workers for race-oriented tests, while updating existing tests to drive the helper process via the new RPC client and APIs.

Reviewed by Cursor Bugbot for commit fa7d2ed. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c39d391711

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go Outdated
@ValentaTomas ValentaTomas marked this pull request as draft May 1, 2026 18:04
Base automatically changed from feat/uffd-test-scaffolding to main May 1, 2026 21:44
Resolve conflicts in cross_process_helpers_test.go and helpers_test.go
by taking the RPC-based test harness from this branch and folding in the
post-#2475 fixes from main:

- helpers_test.go: keep the RPC client/conn/cmd fields plus mutex as
  sync.RWMutex (so executeRead can RLock and executeWrite can Lock,
  matching the just-landed race fix) and the nil-guards in
  executeOperation for ServePause/ServeResume on non-gated handlers.
- cross_process_helpers_test.go: keep the net/rpc + jsonrpc harness,
  add the Service.startServe idempotency guard (early-return when a
  serve goroutine is already running) so a stray duplicate ServeResume
  RPC can't leak an untracked Serve goroutine, mirroring the running
  flag fix on main. Also add the unregister(uffdFd, ...) cleanup at
  the end of the cmd-wait t.Cleanup so a future REMOVE-events test
  doesn't hang in munmap on un-acked events.
- fd_helpers_test.go: auto-merged to keep the unregister helper from
  main (the RPC harness now uses it).
Replace the two test-only func fields beforeWorkerRLockHook and
beforeFaultPageHook on Userfaultfd with a single atomic.Pointer[testHooks]
holding a struct of optional callbacks. Lets PR #2521's race tests add a
new hook without touching the production struct, and shrinks the
test-surface comment from 17 lines to 3.

No production behavior change; hot-path cost is identical (one atomic
load + nil check vs one nil-check per hook today).
…to RPC

Reduce parent ↔ child side channels from {7 env config vars, 1 helper
flag, 1 uffd ExtraFile, 1 content tmp file, 1 RPC socket} to
{2 env vars, 1 uffd ExtraFile, 1 RPC socket}. Everything that was
configured via env or the content file now flows through a single
Lifecycle.Bootstrap RPC.

Drops:
- 7 config env vars (GO_UFFD_SOCKET kept; GO_UFFD_CONTENT, GO_UFFD_MMAP_*,
  GO_UFFD_ALWAYS_WP, GO_UFFD_GATED, GO_UFFD_BARRIERS removed)
- the content.bin tmp file (passed as []byte in BootstrapArgs)
- the testConfig.gated flag and its conditional RPC wiring
- childForkMu — replaced with a single F_DUPFD_CLOEXEC fcntl that
  removes the dup-then-set-cloexec race window the mutex was guarding

Splits the 752-line cross_process_helpers_test.go into seven cohesive
files: harness_{parent,child,client}_test.go for the wire/IPC plumbing
and rpc_{lifecycle,paging,barriers}_test.go for the three RPC services.
The Service god-struct becomes three small services multi-registered
with net/rpc; the parent calls Lifecycle.Bootstrap, Paging.Pause,
Barriers.Install, etc. via a typed *harnessClient wrapper.

testHandler now holds a single *harnessClient instead of
{client, conn, cmd} plus servePause/serveResume/pageStatesOnce
function fields; pageStatesOnce becomes a regular method that
delegates to the client.

Test-only; no production code changes. go vet / golangci-lint /
\`sudo go test -race\` all clean (×2).
Replace net.Listen/Accept/Dial over a tmp Unix socket with a
socketpair(2) fd pair handed to the child via cmd.ExtraFiles.

Removes:
- envSocketPath ("GO_UFFD_SOCKET") — only envHelperFlag remains
- net.ListenConfig.Listen on a t.TempDir() path
- listener.Accept with its 10s timeout select race
- the temp socket path inode under t.TempDir()

Side channels are now {GO_TEST_HELPER_PROCESS env var, uffd fd at fd 3,
rpc socket fd at fd 4}.
The previous Once-suffix was kept temporarily for call-site stability;
the method has always been callable any number of times. Rename the
method and update the eight call sites.
…to internal/rpcharness

Move the parts of the test harness that don't touch *Userfaultfd
internals into a new internal subpackage:

  packages/.../uffd/userfaultfd/internal/rpcharness/
      wire.go     — Bootstrap/PageStates/FaultBarrier/Token args+replies
      client.go   — typed Client wrapping *rpc.Client (Bootstrap, Pause,
                    Resume, PageStates, InstallBarrier, WaitFaultHeld,
                    ReleaseFault, Shutdown, Close)
      barriers.go — Registry + Point + HookFor

The RPC service implementations stay as _test.go in userfaultfd
(they need access to unexported pageStateEntries, settleRequests,
pageTracker, defaultCopyMode etc.). They become smaller — wire types
no longer locally defined; barrier registry consumed via interface
boundary; testHandler holds a single *rpcharness.Client instead of
{*rpc.Client, servePause, serveResume, pageStates} func fields.

internal/ scoping ensures the testutil package can never leak into a
production import path.

Test-only.
* Production userfaultfd.go drops the testHooks struct, the two
  callBefore* methods, and their multi-line test-explaining comments.
  What remains is a single atomic.Pointer[func(uintptr, faultPhase)]
  field plus inline call sites — the floor without build tags.
* rpcharness.Registry.HookFor(point) collapses into Hook(): one
  closure dispatches by (addr, point) instead of binding the point
  in the closure.
* Bootstrap wires a single SetTestFaultHook with an explicit
  faultPhase → rpcharness.Point switch, since the two enums use
  different numeric values intentionally.
…de, rename NewUserfaultfdFromFd → NewFromFd

Pure cleanup with two intentional behavior-equivalent tweaks:

- Flatten Paging.Resume to call state.startServe(), the locked
  wrapper around startServeLocked, mirroring the Pause/stopServe
  pattern.
- Align rpcharness.Point iota with userfaultfd.faultPhase iota
  (both now 0/1) so the test fault hook can pass values through
  with a numeric cast instead of a switch.

The renamed constructor keeps "wraps an existing fd" intent visible
at the call site: every caller today wraps an fd received over a
socket or via ExtraFiles; nobody asks the constructor to also run
the userfaultfd(2) syscall and register the region.
Keep the public symbol unchanged across all call sites so the PR
review focuses on the substantive harness simplification rather
than a name change that doesn't buy anything load-bearing.
…rap failure

Address two Cursor Bugbot findings on #2519:

* harness_parent_test.go: register the helper-process cleanup
  immediately after rpcharness.NewClient(parentConn) so a failure
  in client.Bootstrap or client.WaitReady still reaps the child
  and closes the rpc socket.
* rpc_services_test.go: startServeLocked now returns error;
  Lifecycle.Bootstrap and Paging.Resume propagate it instead of
  swallowing fdexit.New failures to stderr while telling the
  parent the serve loop is up.
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/rpc_services_test.go Outdated
@ValentaTomas ValentaTomas marked this pull request as ready for review May 2, 2026 06:22
… read

Mirror PrefetchData's drain-and-hold pattern. The previous Lock+immediate-
Unlock left a window where new workers could enter faultPage and mutate
pageTracker between the unlock and the pageTracker.mu.RLock acquisition,
breaking the "settled snapshot" intent the comment claimed.

Resolves Cursor Bugbot finding on rpc_services_test.go.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee0186b5b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/rpc_services_test.go Outdated
Sibling of the existing testutils helpers; rename + move in one step.
The path now makes the test-only origin obvious. Drops internal/
scoping but no production caller exists or would want this package.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 packages/orchestrator/pkg/sandbox/uffd/userfaultfd/internal/rpcharness/barriers.go:103-123Registry.Release(token) at internal/rpcharness/barriers.go:110 unconditionally deletes byKey[{s.addr, s.point}] when the token's slot exists, but Install (line 68) overwrites that key on every Install for the same (addr, point), so by the time a stale token is released the key may map to a newer barrier B. After Release(A), byKey[K] is gone, so when a worker faults at (addr, point) Hook() calls lookupByAddr → nil and never parks at slotB.arrived, leaving any later WaitArrived(B) to block forever (with WaitHeld calling WaitArrived(context.Background(), …) from rpc_services_test.go, "forever" is literal).

The fix is the standard compare-and-delete: only delete byKey[{s.addr, s.point}] when r.byKey[{s.addr, s.point}] == token. The same defect was previously flagged P2 by chatgpt-codex on the now-deleted cross_process_helpers_test.go and migrated unfixed into barriers.go.

Extended reasoning...

Bug

`Registry.Release` deletes the `byKey` reverse-index entry for a token whose entry may have been replaced by a later `Install`. The mapping in `byKey` is single-valued per `(addr, point)` — `Install` (line 67-68) unconditionally overwrites it — so re-installing the same key supersedes the prior token's reverse-index entry. `Release` does not check that `byKey[K]` still maps to the released token before deleting, so it happily wipes the entry that points to a different live token.

Concrete trace

  1. `Install(addr, point)` → `token A=1`. State: `tokens={1:slotA}`, `byKey[K]=1` where `K={addr, point}`.
  2. `Install(addr, point)` again → `token B=2`. `Install` overwrites: `byKey[K]=2`. State: `tokens={1:slotA, 2:slotB}`, `byKey[K]=2`.
  3. `Release(1)` runs:
    • line 107: `s = tokens[1] = slotA`
    • line 108: `delete(tokens, 1)` — fine.
    • line 110: `delete(byKey, key{slotA.addr, slotA.point})` = `delete(byKey, K)`. But `byKey[K]` is `2`, not `1`. The reverse-index entry pointing at the still-live B is gone.
    • line 121: closes `slotA.release` — fine, slotA is being torn down.
  4. State after Release: `tokens={2:slotB}`, `byKey={}`. Token B exists but is unreachable from the address-side index.
  5. A worker fault arrives at `(addr, point)`. The hook at `barriers.go:142-155` calls `lookupByAddr(addr, point)` → `byKey` lookup misses → returns `nil` → hook returns immediately without parking.
  6. `slotB.arrived` is never closed.
  7. Parent calls `WaitArrived(B)`. The select at line 95 blocks on `s.arrived` indefinitely. Because `Barriers.WaitHeld` in `rpc_services_test.go` passes `context.Background()`, the ctx branch never fires either.
  8. The test eventually hits `-timeout` and the whole binary is killed.

Why existing code does not prevent this

  • `Install` does not check whether `byKey[K]` is already populated, so the supersede case is silent rather than rejected.
  • `Release` does not consult `byKey[K]` before deleting — the key chosen for deletion is `{s.addr, s.point}` from the released slot, not from the current map state.
  • `WaitArrived` looks up the slot directly via `tokens[token]` so it does see slotB, but it has no way to know slotB is unreachable from the hook side. It just sits on the unbuffered channel.
  • `Hook` uses `lookupByAddr` (the only addr-keyed accessor), so once `byKey` is wrong every fault at that address misses the parked slot.

Impact

The barrier API is brand-new scaffolding in this PR; the immediate `*_test.go` files don't exercise an `Install A → Install B → Release A` sequence. But the PR description explicitly stages this surface for the next stacked PR's race tests ("The barrier methods are wired up but the test-side wrappers around them ship with the race-tests PR — kept here only so PR 3 lands as pure additions"). The race-tests PR will be exactly the population that does non-trivial `Install`/`Release` ordering, and a manifestation here would be an opaque test hang — `WaitArrived` parks on a channel with no diagnostic, the 5-minute test timeout kills the whole binary, and CI logs show only `panic: test timed out`. That is the worst possible failure mode for race-test scaffolding because every race test presents the same symptom regardless of which one tripped this. Right time to fix it is in the PR that introduces the surface, while it is one line.

Fix

Standard compare-and-delete:

```go
func (r *Registry) Release(token uint64) {
r.mu.Lock()
s, ok := r.tokens[token]
delete(r.tokens, token)
if ok {
k := key{s.addr, s.point}
if r.byKey[k] == token {
delete(r.byKey, k)
}
}
r.mu.Unlock()
// ...
}
```

Only delete the reverse-index entry when it still maps to the released token. Mirrors the standard pattern when keys can be re-used.

Prior-art note

`chatgpt-codex` flagged this same bug at P2 on the deleted `cross_process_helpers_test.go` (see "Preserve byKey entry when releasing superseded barrier" in the PR timeline). The bug migrated unfixed into `barriers.go` during the rpcharness extraction.

🔬 also observed by chatgpt-codex-connector

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/rpc_services_test.go Outdated
Address two ChatGPT-Codex-Connector findings on commit ee0186b:

* harness_parent_test.go: register helper cleanup immediately after
  cmd.Start succeeds. The closure picks the graceful (client.Shutdown)
  or hard (cmd.Process.Kill) path depending on whether the rpc client
  has been constructed yet, so a net.FileConn failure no longer leaks
  the child or the uffd registration.
* rpc_services_test.go: only create the barrier Registry when
  args.Barriers is true. With Barriers=false, Barriers.Install used
  to succeed and Barriers.WaitHeld then blocked forever; now it
  errors fast via the existing nil check, with a clearer message.
…ensive locking, Release token-reuse

Address four review findings on PR #2519, all real bugs introduced by
the harness rewrite:

* harness_parent_test.go: register the kernel uffd close, the
  unregister, and the child-reap cleanups at their points of
  acquisition. Failures in configureApi/register/FcntlInt/Socketpair/
  cmd.Start no longer leak fds or zombie processes.
* rpc_services_test.go: replace the harnessState.shutdown chan
  with a shutdownCtx; Barriers.WaitHeld now derives its ctx from
  it, so a misconfigured install or a child crash during a fault no
  longer parks the rpc handler forever (and SIGKILLs the test
  binary via Go's -timeout reaper).
* rpc_services_test.go: re-add pageTracker.mu.RLock under the
  settleRequests.Lock in pageStateEntries. settleRequests alone is
  sufficient today but PR 2's REMOVE handler will write
  pageTracker.m without going through settleRequests; the inner
  RLock makes the snapshot defensible against that change.
* testutils/testharness/barriers.go: Registry.Release now
  compare-and-deletes byKey, so an earlier token's release no
  longer clobbers a later Install for the same (addr, point) and
  the later WaitArrived no longer blocks forever.
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/harness_child_test.go Outdated
…_000_000

Reverts the 1M -> 10K reduction made in #2461. Intermittent failures
under 1M are bugs to be hunted, not silenced by lowering the load.
The other parallel/serial variants in these files have legitimately
been at 10K since their introduction in November 2025; only the two
TestParallelMissing(/Write) cases were reduced.
…MissingWriteWithPrefault loads to 1_000_000

Continues the regression sweep started in 4681ab2. The original
PR #1415 (88c3960, Nov 3 2025) introduced these tests at 1_000_000
operations. PR #1450 (6ee2ebb, Nov 7 2025, "Modify UFFD tests to
run serve loop in a separate process") cut them all to 10_000 in the
same commit that switched from in-process to cross-process tests, with
no commit message justification. The cross-process refactor itself
does not require lower iteration counts: the per-iteration work is
still in-process memory access; only the page-state snapshot RPC is
cross-process and is called once per test.

Reverts:
  - TestSerialMissing:                10_000 -> 1_000_000
  - TestSerialMissingWrite:           10_000 -> 1_000_000
  - TestParallelMissingWriteWithPrefault: 10_000 -> 1_000_000

Per-test wall time on this branch (sudo go test, no -race):
  - TestSerialMissing                  ~3s
  - TestSerialMissingWrite             ~3s
  - TestParallelMissingWriteWithPrefault ~2s

TestParallelMissingWithPrefault stays at 10_000: it was only ever
parallelOperations := 10 in #1415, then bumped to 10_000 in #1450,
so 10_000 is already above its historical baseline.
Address review findings from PR #2519:

* rpc_services_test.go: stopServe now snapshots and clears s.stop
  under the lock, then drops the lock before calling stop() (which
  blocks on <-done). Holding s.mu across the drain blocked every
  concurrent RPC handler that takes s.mu, and would deadlock the
  upcoming race tests where WaitFaultHeld needs s.mu while a worker
  is parked at a barrier waiting for a paired Pause to drain.

* harness_child_test.go: guard the child's rpc conn close with a
  sync.Once so the deferred safety-net close (needed for early
  returns from rpc.Register) does not double-close after the
  explicit close that unblocks server.ServeCodec on the success path.

Audit of the rest of rpc_services_test.go found no other
lock-during-blocking patterns: releaseAllBarriers, Paging.States,
and Barriers.{Install,WaitHeld,Release} already snapshot under the
lock and call out unlocked; Lifecycle.{Bootstrap,Shutdown,Resume}
only do non-blocking work under the lock (channel send via
context.cancel, fdexit.New, goroutine spawn).
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7683f88. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/missing_test.go
…lpers

Pure cleanup; no behavior change.

- trim file-header and method docs across testharness/ + harness_*_test.go
- inline releaseAllBarriers (single caller)
- preserve drain-fence and REMOVE-defensive WHYs on pageStateEntries
@ValentaTomas ValentaTomas merged commit 1af207c into main May 2, 2026
45 checks passed
@ValentaTomas ValentaTomas deleted the refactor/uffd-test-harness branch May 2, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants