Skip to content

fix(uffd): read page state inside worker under settleRequests.RLock#2512

Closed
ValentaTomas wants to merge 2 commits intotest/uffd-stale-source-race-testsfrom
fix/uffd-stale-source-race
Closed

fix(uffd): read page state inside worker under settleRequests.RLock#2512
ValentaTomas wants to merge 2 commits intotest/uffd-stale-source-race-testsfrom
fix/uffd-stale-source-race

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Apr 28, 2026

Summary

Move the pageTracker.get(addr) switch and the source = u.src capture from the parent Serve() loop into the worker goroutine, after settleRequests.RLock(). Read + act + commit is now atomic with respect to the REMOVE batch.

This PR is deliberately split from #2521 so each PR has a single concern: #2521 lands the deterministic regression tests (the red CI is the bug demo); this PR lands the one-file production fix on top and turns those tests green.

Changes

  • userfaultfd.go only (+35 / -27). Conceptually three lines moved and one continue turned into return nil to fit the goroutine closure.

Root cause

Pre-fix, a UFFD_EVENT_REMOVE arriving between the parent loop reading pageTracker.get(addr) / capturing source = u.src and the worker getting onto a CPU would commit removed under settleRequests.Lock() first. The worker would then UFFDIO_COPY the stale source bytes into the page the kernel had just MADV_DONTNEED'd and overwrite removed with faulted. User-visible symptoms: orchestrator parent madvise() blocking forever and pages reappearing with stale src bytes after a remove.

Stack

Test plan

  • go build ./...
  • go vet ./pkg/sandbox/uffd/...
  • golangci-lint run ./pkg/sandbox/uffd/userfaultfd/... ./pkg/sandbox/uffd/testutils/...
  • sudo GOMAXPROCS=2 go test -race -timeout 15m -count=1 ./pkg/sandbox/uffd/userfaultfd/... -- all race tests added by test(uffd): add deterministic UFFD stale-source race tests #2521 now pass, including TestStaleSourceRaceMissingAndRemove/{4k,hugepage}.
  • Soak: sudo go test -count=20 -timeout=30s -run 'TestStaleSourceRaceMissingAndRemove|TestNoMadviseDeadlockWithInflightCopy|TestFaultedShortCircuitOrdering' ./pkg/sandbox/uffd/userfaultfd/... -- deterministic pass.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 28, 2026

PR Summary

High Risk
Changes synchronization around userfaultfd page state transitions in the orchestrator sandbox, which is concurrency- and memory-safety-critical. A mistake here could reintroduce deadlocks or stale page mappings under load.

Overview
Fixes a race in userfaultfd.Serve() by moving the per-page pageTracker state check (and deciding whether to copy from the source vs zero-fill) into the worker goroutine after settleRequests.RLock(), so pagefault handling can’t overwrite a concurrent REMOVE batch’s removed state. Also adds an ARM64 CI step to print runner/kernel/hugepage details for orchestrator test jobs to aid debugging environment-specific failures.

Reviewed by Cursor Bugbot for commit a794b82. Bugbot is set up for automated code reviews on this repo. Configure here.

ValentaTomas added a commit that referenced this pull request Apr 28, 2026
…e from feat/free-page-reporting

Lift the production-side UFFD subsystem closure that the upcoming RPC
test harness and race tests depend on, taken verbatim from the tip of
feat/free-page-reporting (commit f310273). The closure brings in:

  - pageTracker gains the `removed` pageState
  - Userfaultfd.Serve() splits readEvents into removes + pagefaults,
    drains the REMOVE batch under settleRequests.Lock(), then dispatches
    the pagefault batch
  - Worker fault dispatch with switch on pageTracker state (note: the
    state-snapshot read happens in the parent loop here — that is the
    bug the stacked PR #2512 fixes)
  - wakeupPipe self-pipe to wake the poll loop when a goroutine defers a
    page fault
  - defaultCopyMode override, MapMemoryReadyAddr lifecycle bits
  - deferred.go (deferred pagefault list)
  - prefault.go updates for prefetch / WP coordination
  - remove_test.go (REMOVE-event integration tests)
  - cross_process_helpers_test.go / helpers_test.go / fd_helpers_test.go
    REMOVE-aware variants

This commit is purely a state lift of
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/ — no other paths
touched, no Firecracker bumps, no feature flags, no proto regen, no
template-manager / API plumbing.

Required so that test/uffd-rpc-harness-and-race-tests can target main
(via feat/uffd-test-scaffolding) instead of being stacked on top of the
full free-page-reporting feature PR (#1896).
@ValentaTomas ValentaTomas force-pushed the test/uffd-rpc-harness-and-race-tests branch from 40be150 to 763945a Compare April 28, 2026 04:51
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from 51664dd to 562b16b Compare April 28, 2026 04:52
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from 562b16b to 7ad9154 Compare April 29, 2026 00:28
@ValentaTomas ValentaTomas changed the base branch from test/uffd-rpc-harness-and-race-tests to test/uffd-stale-source-race-tests April 29, 2026 00:29
@ValentaTomas ValentaTomas marked this pull request as draft May 1, 2026 18:04
ValentaTomas added a commit that referenced this pull request May 2, 2026
…trix-mode tests

Production:
  - pageState gains `removed`; pageTracker gains `get(addr)`.
  - Userfaultfd.Serve() splits readEvents into removes + pagefaults,
    drains the REMOVE batch under settleRequests.Lock(), then
    dispatches the pagefault batch.
  - Worker dispatch switches on pageTracker.get(addr): faulted ->
    short-circuit, removed -> zero-fill (source = nil), missing ->
    copy from u.src. NOTE: the state read and `source` capture
    happen in the parent loop BEFORE the worker takes
    settleRequests.RLock(). This is the buggy shape that PR #4
    fixes; PR #3 adds the deterministic race tests that catch it.
  - faultPage gains zero-fill paths for source == nil (4K read =
    DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
    copy(EmptyHugePage)) and returns (handled bool, err error) so
    the worker can defer EAGAIN-on-COPY-during-REMOVE faults.
  - wakeupPipe + deferredFaults wake the poll loop when a worker
    defers a fault.
  - Prefault path checks pageTracker for faulted || removed and
    short-circuits.

Tests:
  - testConfig gains `removeEnabled bool`; configureApi optionally
    enables UFFD_FEATURE_EVENT_REMOVE based on it; the parent
    cleanup unregisters the UFFD region when REMOVE is on so munmap
    doesn't block on un-acked events.
  - PageStates RPC + handlerPageStates now expose `removed`.
  - operationModeRemove + executeRemove (madvise MADV_DONTNEED).
  - runMatrix(t, tt, body) wraps every existing generic test in two
    parallel subtests: remove-off (regression for the no-REMOVE
    path that production templates still use) and remove-on
    (covers the new code path).
  - remove_test.go: REMOVE-specific TestRemove, TestRemoveThenFault,
    TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
    are nolint'd as serialised - a paused gated handler keeps a
    faulting goroutine suspended in the kernel pagefault path; a
    STW GC pause from a parallel test would wait forever for that
    goroutine to reach a safe point.

Out of scope (lives in stacked PRs):
  - Race tests demonstrating the stale-source bug -> PR #3.
  - The fix moving state read into the worker -> PR #4 (#2512).
ValentaTomas added a commit that referenced this pull request May 2, 2026
…ed-short-circuit race tests

Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    stale-source bug. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits for
    the REMOVE batch to commit, releases the worker, then asserts the
    page is zero-filled. INTENTIONALLY FAILS on this PR with
    `page 1 first byte: want 0 ... got 0xc3` - the worker captured
    `source = u.src` in the parent loop before the REMOVE landed and
    UFFDIO_COPY'd the planted sentinel into the page after the kernel
    had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
    state inside the worker under settleRequests.RLock.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
    Parks the worker via barrierBeforeFaultPage (holding RLock), fires
    madvise, asserts madvise returns within 2s. Passes today; protects
    against any future change that accidentally couples readEvents to
    settleRequests.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
    pagefault batch ordering using the gated harness. Pins the
    invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration.

Test infrastructure additions:
  - testHandler.installFaultBarrier / waitFaultHeld / releaseFault
    convenience wrappers around the Service.* RPCs from PR #1.
  - testConfig.sourcePatcher hook so race tests can plant a
    deterministic sentinel into the random source data BEFORE the
    content file is written, without depending on the happenstance
    value of any randomly-generated byte.

ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from e744f88 to f8f9d6e Compare May 2, 2026 08:59
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from 7ad9154 to 84b9c5d Compare May 2, 2026 09:06
@ValentaTomas ValentaTomas changed the title fix(uffd): re-read page state inside worker goroutine under settleRequests.RLock fix(uffd): read page state inside worker under settleRequests.RLock May 2, 2026
ValentaTomas added a commit that referenced this pull request May 2, 2026
…ed-short-circuit race tests

Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    stale-source bug. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits for
    the REMOVE batch to commit, releases the worker, then asserts the
    page is zero-filled. INTENTIONALLY FAILS on this PR with
    `page 1 first byte: want 0 ... got 0xc3` - the worker captured
    `source = u.src` in the parent loop before the REMOVE landed and
    UFFDIO_COPY'd the planted sentinel into the page after the kernel
    had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
    state inside the worker under settleRequests.RLock.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
    Parks the worker via barrierBeforeFaultPage (holding RLock), fires
    madvise, asserts madvise returns within 2s. Passes today; protects
    against any future change that accidentally couples readEvents to
    settleRequests.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
    pagefault batch ordering using the gated harness. Pins the
    invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration.

Test infrastructure additions:
  - testHandler.installFaultBarrier / waitFaultHeld / releaseFault
    convenience wrappers around the Service.* RPCs from PR #1.
  - testConfig.sourcePatcher hook so race tests can plant a
    deterministic sentinel into the random source data BEFORE the
    content file is written, without depending on the happenstance
    value of any randomly-generated byte.

ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
ValentaTomas added a commit that referenced this pull request May 2, 2026
- waitForState: add default case to avoid silent busy-poll on
  unrecognised pageState values.
- TestFaultedShortCircuitOrdering: rewrite docstring to accurately
  describe coverage (disjoint-page end-state check, not an ordering
  invariant guard; same-page ordering is covered by TestStaleSource...).
- TestStaleSourceRaceMissingAndRemove: fix "MISSING-write fault" stale
  docstring to "MISSING (READ) fault", note both variants fail until #2512.
- Trim verbose multi-line constant and helper comments down to
  load-bearing WHY.
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from f8f9d6e to 1e978ee Compare May 2, 2026 10:20
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from 84b9c5d to e5274b5 Compare May 2, 2026 10:27
ValentaTomas added a commit that referenced this pull request May 2, 2026
…ed-short-circuit race tests

Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    stale-source bug. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits for
    the REMOVE batch to commit, releases the worker, then asserts the
    page is zero-filled. INTENTIONALLY FAILS on this PR with
    `page 1 first byte: want 0 ... got 0xc3` - the worker captured
    `source = u.src` in the parent loop before the REMOVE landed and
    UFFDIO_COPY'd the planted sentinel into the page after the kernel
    had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
    state inside the worker under settleRequests.RLock.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
    Parks the worker via barrierBeforeFaultPage (holding RLock), fires
    madvise, asserts madvise returns within 2s. Passes today; protects
    against any future change that accidentally couples readEvents to
    settleRequests.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
    pagefault batch ordering using the gated harness. Pins the
    invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration.

Test infrastructure additions:
  - testHandler.installFaultBarrier / waitFaultHeld / releaseFault
    convenience wrappers around the Service.* RPCs from PR #1.
  - testConfig.sourcePatcher hook so race tests can plant a
    deterministic sentinel into the random source data BEFORE the
    content file is written, without depending on the happenstance
    value of any randomly-generated byte.

ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
ValentaTomas added a commit that referenced this pull request May 2, 2026
- waitForState: add default case to avoid silent busy-poll on
  unrecognised pageState values.
- TestFaultedShortCircuitOrdering: rewrite docstring to accurately
  describe coverage (disjoint-page end-state check, not an ordering
  invariant guard; same-page ordering is covered by TestStaleSource...).
- TestStaleSourceRaceMissingAndRemove: fix "MISSING-write fault" stale
  docstring to "MISSING (READ) fault", note both variants fail until #2512.
- Trim verbose multi-line constant and helper comments down to
  load-bearing WHY.
@ValentaTomas ValentaTomas force-pushed the test/uffd-stale-source-race-tests branch from 1e978ee to 3047e69 Compare May 2, 2026 10:37
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from e5274b5 to aa19fdc Compare May 2, 2026 10:37
…uests.RLock

The Serve() loop previously read pageTracker state and captured
`source = u.src` in the parent loop, then dispatched a worker
goroutine. A REMOVE event for the same page that arrived between the
state read and the worker actually acquiring settleRequests.RLock()
would silently leave the worker with a stale `source = u.src`
snapshot. The worker would then UFFDIO_COPY src bytes into a page the
kernel had just MADV_DONTNEED'd, leaving pageTracker == removed and
the kernel page mapped with stale src data — and observably
deadlocking parent madvise() in the orchestrator unit-test suite.

Move the state lookup and source capture inside the goroutine, after
RLock(). The read+act+commit sequence is now atomic with respect to
the REMOVE batch (which takes settleRequests.Lock()).

Makes the three deterministic race tests added in the parent PR pass:
  - TestStaleSourceRaceMissingAndRemove (the one that intentionally
    failed on the parent PR with `page 1 first byte: want 0 ... got 0xc3`)
  - TestNoMadviseDeadlockWithInflightCopy (already passed; now stays green)
  - TestFaultedShortCircuitOrdering (already passed; now stays green)

Soak: -count=20 -timeout=30s passes deterministically on this branch.
@ValentaTomas ValentaTomas force-pushed the fix/uffd-stale-source-race branch from aa19fdc to 26be119 Compare May 2, 2026 10:55
ValentaTomas added a commit that referenced this pull request May 2, 2026
…onrpc over unix socket (#2519)

Replace the cross-process userfaultfd test harness's pipes + signals
(`SIGUSR1` shutdown, `SIGUSR2` page-state snapshot,
ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying
stdlib `net/rpc` + `net/rpc/jsonrpc`. The userfaultfd and the rpc
socketpair half are passed via `ExtraFiles`.

Production change: one `atomic.Pointer[func(uintptr, faultPhase)]` field
on `Userfaultfd` and three nil-checked inline call sites. Test builds
install the hook via `SetTestFaultHook` defined in a `_test.go` file.

Stacked follow-ups:

- `UFFD_EVENT_REMOVE` handling + matrix tests — #2520
- Stale-source / madvise-deadlock / faulted-short-circuit race tests —
#2521
- Stale-source race fix — #2512
ValentaTomas added a commit that referenced this pull request May 3, 2026
…ed-short-circuit race tests

Three race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None use sleeps, retries, or soak loops - each
test installs explicit barriers on the child's worker goroutine, drives
the racing kernel operation from the parent, and asserts on a concrete
post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    stale-source bug. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits for
    the REMOVE batch to commit, releases the worker, then asserts the
    page is zero-filled. INTENTIONALLY FAILS on this PR with
    `page 1 first byte: want 0 ... got 0xc3` - the worker captured
    `source = u.src` in the parent loop before the REMOVE landed and
    UFFDIO_COPY'd the planted sentinel into the page after the kernel
    had MADV_DONTNEED'd it. PR #4 (#2512) makes this pass by re-reading
    state inside the worker under settleRequests.RLock.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test.
    Parks the worker via barrierBeforeFaultPage (holding RLock), fires
    madvise, asserts madvise returns within 2s. Passes today; protects
    against any future change that accidentally couples readEvents to
    settleRequests.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-
    pagefault batch ordering using the gated harness. Pins the
    invariant that REMOVE batches drain before pagefault dispatch in
    a single Serve iteration.

Test infrastructure additions:
  - testHandler.installFaultBarrier / waitFaultHeld / releaseFault
    convenience wrappers around the Service.* RPCs from PR #1.
  - testConfig.sourcePatcher hook so race tests can plant a
    deterministic sentinel into the random source data BEFORE the
    content file is written, without depending on the happenstance
    value of any randomly-generated byte.

ALL OTHER TESTS in the package still pass on this PR; only the three
sub-tests of TestStaleSourceRaceMissingAndRemove fail (the bug
demonstration).
ValentaTomas added a commit that referenced this pull request May 3, 2026
- waitForState: add default case to avoid silent busy-poll on
  unrecognised pageState values.
- TestFaultedShortCircuitOrdering: rewrite docstring to accurately
  describe coverage (disjoint-page end-state check, not an ordering
  invariant guard; same-page ordering is covered by TestStaleSource...).
- TestStaleSourceRaceMissingAndRemove: fix "MISSING-write fault" stale
  docstring to "MISSING (READ) fault", note both variants fail until #2512.
- Trim verbose multi-line constant and helper comments down to
  load-bearing WHY.
@ValentaTomas
Copy link
Copy Markdown
Member Author

Folded into #2520 — race-test demonstration and fix now live alongside the REMOVE matrix tests so the stack is just #2519#2520.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants