test(uffd): add deterministic UFFD stale-source race tests#2521
test(uffd): add deterministic UFFD stale-source race tests#2521ValentaTomas wants to merge 3 commits intofeat/uffd-remove-events-matrixfrom
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 4e7a838. Bugbot is set up for automated code reviews on this repo. Configure here. |
dobrac
left a comment
There was a problem hiding this comment.
lets resolve the comments please 🙏
Replace the two test-only func fields beforeWorkerRLockHook and beforeFaultPageHook on Userfaultfd with a single atomic.Pointer[testHooks] holding a struct of optional callbacks. Lets PR #2521's race tests add a new hook without touching the production struct, and shrinks the test-surface comment from 17 lines to 3. No production behavior change; hot-path cost is identical (one atomic load + nil check vs one nil-check per hook today).
da91230 to
efa01a5
Compare
e744f88 to
f8f9d6e
Compare
f8f9d6e to
1e978ee
Compare
…ed-short-circuit race tests
Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.
- TestStaleSourceRaceMissingAndRemove: regression test for the
stale-source bug. Plants a non-zero sentinel into the source page,
parks the worker via barrierBeforeRLock, fires madvise, waits for
the REMOVE batch to commit, releases the worker, then asserts the
page is zero-filled. INTENTIONALLY FAILS on this PR with
`page 1 first byte: want 0 ... got 0xc3` - the worker captured
`source = u.src` in the parent loop before the REMOVE landed and
UFFDIO_COPY'd the planted sentinel into the page after the kernel
had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
state inside the worker under settleRequests.RLock.
- TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
Parks the worker via barrierBeforeFaultPage (holding RLock), fires
madvise, asserts madvise returns within 2s. Passes today; protects
against any future change that accidentally couples readEvents to
settleRequests.
- TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
pagefault batch ordering using the gated harness. Pins the
invariant that REMOVE batches drain before pagefault dispatch in
a single Serve iteration.
Test infrastructure additions:
- testHandler.installFaultBarrier / waitFaultHeld / releaseFault
convenience wrappers around the Service.* RPCs from PR #1.
- testConfig.sourcePatcher hook so race tests can plant a
deterministic sentinel into the random source data BEFORE the
content file is written, without depending on the happenstance
value of any randomly-generated byte.
ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
- waitForState: add default case to avoid silent busy-poll on unrecognised pageState values. - TestFaultedShortCircuitOrdering: rewrite docstring to accurately describe coverage (disjoint-page end-state check, not an ordering invariant guard; same-page ordering is covered by TestStaleSource...). - TestStaleSourceRaceMissingAndRemove: fix "MISSING-write fault" stale docstring to "MISSING (READ) fault", note both variants fail until #2512. - Trim verbose multi-line constant and helper comments down to load-bearing WHY.
1e978ee to
3047e69
Compare
…onrpc over unix socket (#2519) Replace the cross-process userfaultfd test harness's pipes + signals (`SIGUSR1` shutdown, `SIGUSR2` page-state snapshot, ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying stdlib `net/rpc` + `net/rpc/jsonrpc`. The userfaultfd and the rpc socketpair half are passed via `ExtraFiles`. Production change: one `atomic.Pointer[func(uintptr, faultPhase)]` field on `Userfaultfd` and three nil-checked inline call sites. Test builds install the hook via `SetTestFaultHook` defined in a `_test.go` file. Stacked follow-ups: - `UFFD_EVENT_REMOVE` handling + matrix tests — #2520 - Stale-source / madvise-deadlock / faulted-short-circuit race tests — #2521 - Stale-source race fix — #2512
Summary
Add three deterministic race regression tests built on the RPC harness (#2519) and the REMOVE handling (#2520). Each reproduces its target scenario as a sub-second targeted assertion instead of a 30-minute CI timeout.
This PR is deliberately split from the fix (#2512) so the bug can be demonstrated on its own CI -- the red
TestStaleSourceRaceMissingAndRemoveon this PR and the green run on #2512 form the on-PR before/after proof. Production code: 0 LOC.Changes
race_test.go(+398 LOC) with three tests:TestStaleSourceRaceMissingAndRemove/{4k,hugepage}-- plants a sentinel (0xc3) in the source page, parks the worker beforesettleRequests.RLock(), firesmadvise, waits for the REMOVE batch to commit, releases the worker, asserts the page is zero-filled.TestNoMadviseDeadlockWithInflightCopy-- liveness pin; fails if a future change accidentally couplesreadEventstosettleRequests.TestFaultedShortCircuitOrdering-- pins the invariant that REMOVE batches drain before pagefault dispatch in everyServe()iteration.testHandlergainsinstallFaultBarrier/waitFaultHeld/releaseFaultwrappers around the harness RPCs;testConfiggains asourcePatcherhook.nolint:contextcheckon theexecuteAllhelper (per-opt.Context()is intentionally separate from the bounded race-wrapper ctx);nolint:paralleltest,tparallelon the short-circuit test (a gated handler keeps a goroutine suspended in the kernel pagefault path, so a STW GC pause from a parallel sibling would deadlock).Stack
feat/uffd-remove-events-matrix(feat(uffd): handle UFFD_EVENT_REMOVE; track per-page state; race-safe COPY #2520)fix/uffd-stale-source-race(fix(uffd): read page state inside worker under settleRequests.RLock #2512) -- turns this PR's failing tests green.Test plan
go build ./...go vet ./pkg/sandbox/uffd/...golangci-lint run ./pkg/sandbox/uffd/userfaultfd/... ./pkg/sandbox/uffd/testutils/...sudo GOMAXPROCS=2 go test -race -timeout 15m -count=1 ./pkg/sandbox/uffd/userfaultfd/...-- expected to fail onTestStaleSourceRaceMissingAndRemove/{4k,hugepage}withpage 1 first byte: want 0 (post-fix zero-fault for removed state), got 0xc3. This PR's tests are expected to fail at this layer of the stack; fix(uffd): read page state inside worker under settleRequests.RLock #2512 makes them pass. All other tests pass.