Skip to content

test(uffd): add deterministic UFFD stale-source race tests#2521

Closed
ValentaTomas wants to merge 3 commits intofeat/uffd-remove-events-matrixfrom
test/uffd-stale-source-race-tests
Closed

test(uffd): add deterministic UFFD stale-source race tests#2521
ValentaTomas wants to merge 3 commits intofeat/uffd-remove-events-matrixfrom
test/uffd-stale-source-race-tests

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Apr 29, 2026

Summary

Add three deterministic race regression tests built on the RPC harness (#2519) and the REMOVE handling (#2520). Each reproduces its target scenario as a sub-second targeted assertion instead of a 30-minute CI timeout.

This PR is deliberately split from the fix (#2512) so the bug can be demonstrated on its own CI -- the red TestStaleSourceRaceMissingAndRemove on this PR and the green run on #2512 form the on-PR before/after proof. Production code: 0 LOC.

Changes

  • race_test.go (+398 LOC) with three tests:
    • TestStaleSourceRaceMissingAndRemove/{4k,hugepage} -- plants a sentinel (0xc3) in the source page, parks the worker before settleRequests.RLock(), fires madvise, waits for the REMOVE batch to commit, releases the worker, asserts the page is zero-filled.
    • TestNoMadviseDeadlockWithInflightCopy -- liveness pin; fails if a future change accidentally couples readEvents to settleRequests.
    • TestFaultedShortCircuitOrdering -- pins the invariant that REMOVE batches drain before pagefault dispatch in every Serve() iteration.
  • testHandler gains installFaultBarrier / waitFaultHeld / releaseFault wrappers around the harness RPCs; testConfig gains a sourcePatcher hook.
  • Inline lint suppressions with rationale: nolint:contextcheck on the executeAll helper (per-op t.Context() is intentionally separate from the bounded race-wrapper ctx); nolint:paralleltest,tparallel on the short-circuit test (a gated handler keeps a goroutine suspended in the kernel pagefault path, so a STW GC pause from a parallel sibling would deadlock).

Stack

Test plan

  • go build ./...
  • go vet ./pkg/sandbox/uffd/...
  • golangci-lint run ./pkg/sandbox/uffd/userfaultfd/... ./pkg/sandbox/uffd/testutils/...
  • sudo GOMAXPROCS=2 go test -race -timeout 15m -count=1 ./pkg/sandbox/uffd/userfaultfd/... -- expected to fail on TestStaleSourceRaceMissingAndRemove/{4k,hugepage} with page 1 first byte: want 0 (post-fix zero-fault for removed state), got 0xc3. This PR's tests are expected to fail at this layer of the stack; fix(uffd): read page state inside worker under settleRequests.RLock #2512 makes them pass. All other tests pass.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 29, 2026

PR Summary

Medium Risk
Test-only changes but they exercise timing-sensitive cross-process userfaultfd behavior and introduce intentionally failing coverage for a known race, which can affect CI stability and runtime if flaky or mis-tuned.

Overview
Adds a new set of deterministic race regression tests for userfaultfd REMOVE/pagefault interactions, including a stale-source reproduction, a MADV_DONTNEED liveness/deadlock guard, and an ordering sanity check under paused/resumed serving with tight time budgets. The cross-process test harness is extended to support per-test patching of the generated source buffer and to expose barrier install/wait/release helpers (with context cancellation) so tests can reliably park and resume fault handlers at specific phases.

Reviewed by Cursor Bugbot for commit 4e7a838. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go Outdated
Copy link
Copy Markdown
Contributor

@dobrac dobrac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets resolve the comments please 🙏

@dobrac dobrac marked this pull request as draft April 29, 2026 20:03
Copy link
Copy Markdown

@linearb linearb Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✨ PR Review

LGTM

Generated by LinearB AI and added by gitStream.
AI-generated content may contain inaccuracies. Please verify before using.
💡 Tip: You can customize your AI Review using Guidelines Learn how

ValentaTomas added a commit that referenced this pull request May 1, 2026
Replace the two test-only func fields beforeWorkerRLockHook and
beforeFaultPageHook on Userfaultfd with a single atomic.Pointer[testHooks]
holding a struct of optional callbacks. Lets PR #2521's race tests add a
new hook without touching the production struct, and shrinks the
test-surface comment from 17 lines to 3.

No production behavior change; hot-path cost is identical (one atomic
load + nil check vs one nil-check per hook today).
@ValentaTomas ValentaTomas force-pushed the feat/uffd-remove-events-matrix branch from da91230 to efa01a5 Compare May 2, 2026 08:48
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from e744f88 to f8f9d6e Compare May 2, 2026 08:59
@ValentaTomas ValentaTomas changed the title test(uffd): add deterministic stale-source / madvise-deadlock / faulted-short-circuit race tests test(uffd): add deterministic UFFD stale-source race tests May 2, 2026
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from f8f9d6e to 1e978ee Compare May 2, 2026 10:20
…ed-short-circuit race tests

Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    stale-source bug. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits for
    the REMOVE batch to commit, releases the worker, then asserts the
    page is zero-filled. INTENTIONALLY FAILS on this PR with
    `page 1 first byte: want 0 ... got 0xc3` - the worker captured
    `source = u.src` in the parent loop before the REMOVE landed and
    UFFDIO_COPY'd the planted sentinel into the page after the kernel
    had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
    state inside the worker under settleRequests.RLock.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
    Parks the worker via barrierBeforeFaultPage (holding RLock), fires
    madvise, asserts madvise returns within 2s. Passes today; protects
    against any future change that accidentally couples readEvents to
    settleRequests.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
    pagefault batch ordering using the gated harness. Pins the
    invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration.

Test infrastructure additions:
  - testHandler.installFaultBarrier / waitFaultHeld / releaseFault
    convenience wrappers around the Service.* RPCs from PR #1.
  - testConfig.sourcePatcher hook so race tests can plant a
    deterministic sentinel into the random source data BEFORE the
    content file is written, without depending on the happenstance
    value of any randomly-generated byte.

ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
- waitForState: add default case to avoid silent busy-poll on
  unrecognised pageState values.
- TestFaultedShortCircuitOrdering: rewrite docstring to accurately
  describe coverage (disjoint-page end-state check, not an ordering
  invariant guard; same-page ordering is covered by TestStaleSource...).
- TestStaleSourceRaceMissingAndRemove: fix "MISSING-write fault" stale
  docstring to "MISSING (READ) fault", note both variants fail until #2512.
- Trim verbose multi-line constant and helper comments down to
  load-bearing WHY.
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from 1e978ee to 3047e69 Compare May 2, 2026 10:37
ValentaTomas added a commit that referenced this pull request May 2, 2026
…onrpc over unix socket (#2519)

Replace the cross-process userfaultfd test harness's pipes + signals
(`SIGUSR1` shutdown, `SIGUSR2` page-state snapshot,
ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying
stdlib `net/rpc` + `net/rpc/jsonrpc`. The userfaultfd and the rpc
socketpair half are passed via `ExtraFiles`.

Production change: one `atomic.Pointer[func(uintptr, faultPhase)]` field
on `Userfaultfd` and three nil-checked inline call sites. Test builds
install the hook via `SetTestFaultHook` defined in a `_test.go` file.

Stacked follow-ups:

- `UFFD_EVENT_REMOVE` handling + matrix tests — #2520
- Stale-source / madvise-deadlock / faulted-short-circuit race tests —
#2521
- Stale-source race fix — #2512
@ValentaTomas
Copy link
Copy Markdown
Member Author

Folded into #2520 — race-test demonstration and fix now live alongside the REMOVE matrix tests so the stack is just #2519#2520.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants