Skip to content

test(uffd): rapid state-machine + chaos-source coverage#2544

Closed
ValentaTomas wants to merge 5 commits intofeat/uffd-remove-events-matrixfrom
test/uffd-rapid-and-chaos-coverage
Closed

test(uffd): rapid state-machine + chaos-source coverage#2544
ValentaTomas wants to merge 5 commits intofeat/uffd-remove-events-matrixfrom
test/uffd-rapid-and-chaos-coverage

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Stacks on feat/uffd-remove-events-matrix. Adds two seedable test suites that meaningfully expand coverage. Will rebase onto refactor/uffd-test-child-owned-memory once it lands (that branch is the proper base because both tests use t.Parallel() pressure, which is only safe with the syscall-based STW fix it carries).

Architecture note

Both new tests use a direct parent-side handler (directHandler) rather than the cross-process RPC harness. This means:

  • No child process spawned — the mmap, UFFD fd, and Serve goroutine all live in the test process.
  • Chaos injection works naturally (the chaos source is in the same process as Serve).
  • pageStateEntries() is called directly on *Userfaultfd without RPC round-trips.
  • t.Cleanup handles teardown; the close-based serveDone chan struct{} lets both the test body and t.Cleanup wait concurrently without consuming the channel.

TestRapidStateMachine

Property-based state-machine fuzzer using pgregory.net/rapid v1.3.0. Drives random sequences of read / write / madvise(MADV_DONTNEED) actions against a live handler; the model tracks per-page state (missing / faulted / zero-faulted / removed). The 4-state model correctly handles remove+re-fault pages that are zero-filled (not from source). Rapid shrinks any failing sequence to a minimal counterexample.

Both pagesize arms (4 KiB, 2 MiB) × both removeEnabled modes run as parallel subtests via runMatrix.

Reproduce a failure:

RAPID_SEED=<seed-from-fail-output> go test -run='TestRapidStateMachine' ./pkg/sandbox/uffd/userfaultfd/...

TestChaosCloseTerminatesUnderLatency

Wraps block.Slicer with a chaosSource that injects uniform random [0, 50ms] latency per Slice call. Fires 64 concurrent MADV_POPULATE_READ goroutines, then triggers Shutdown and asserts teardown completes within 5s. Catches Close↔Serve drain ordering regressions where a slow worker would otherwise wedge wg.Wait().

After Serve returns, the UFFD fd is closed so any goroutines still blocked on unresolved faults receive EFAULT and unblock gracefully.

Both pagesize arms × both removeEnabled modes run as parallel subtests via runMatrix.

Reproduce a failure:

UFFD_CHAOS_SEED=<seed-from-fail-output> go test -run='TestChaosCloseTerminatesUnderLatency' ./pkg/sandbox/uffd/userfaultfd/...

Notes

  • pgregory.net/rapid is added as a direct test-only dep at v1.3.0; verified it does not appear in non-test builds (go list -deps ./... | grep pgregory returns nothing).
  • Both tests call t.Parallel() on all subtests. The direct in-process handler avoids the Go-runtime↔UFFD STW deadlock that plagued the cross-process tests (all page faults now happen via MADV_POPULATE_READ/WRITE in _Gsyscall, which is always at a GC safe point).
  • Both tests require root (UFFD_API syscall). Without root, they skip gracefully.
  • Local test run: sudo go test -race -count=3 -timeout=180s -run='TestRapidStateMachine|TestChaosCloseTerminatesUnderLatency' ./pkg/sandbox/uffd/userfaultfd/... — PASS (3.9s).

Upgrades the indirect pin of rapid from v1.2.0 to the direct dependency
v1.3.0 needed by the new TestRapidStateMachine property-based suite.
rapid does NOT appear in non-test builds (verified via go list -deps).
Adds TestRapidStateMachine: a pgregory.net/rapid property-based fuzzer
that drives random read / write / madvise(MADV_DONTNEED) sequences against
a live Userfaultfd handler running in-process (no child RPC server).

The state machine tracks 4 model states (missing / faulted / zero-faulted /
removed) and validates per-action invariants after each step:
- read/write on missing/faulted pages: state → faulted, content = source data
- read/write on removed pages: state → zero-faulted, content = zero fill
- madvise on faulted pages: state → removed (waits for REMOVE event drain)

Both pagesize arms (4 KiB, 2 MiB hugepage) × both removeEnabled modes run
as parallel subtests via runMatrix.

Also adds directHandler — a lightweight in-process UFFD harness used by both
the rapid and chaos tests. It creates the mmap, configures the fd, and runs
Userfaultfd.Serve in a background goroutine; t.Cleanup handles teardown.

Reproduce a failure:
  RAPID_SEED=<seed> go test -run=TestRapidStateMachine ./pkg/sandbox/uffd/userfaultfd/...
Wraps block.Slicer with a seedable chaosSource that injects uniform
random [0, 50ms] latency per Slice call, fires 64 concurrent
MADV_POPULATE_READ goroutines, triggers Shutdown, and asserts
teardown completes within 5 seconds.

Guards Close↔Serve drain regressions where a slow in-flight worker
would otherwise wedge the wg.Wait() drain in the Serve exit path.

Both pagesize arms × both removeEnabled modes run as parallel subtests
via runMatrix. Seed is logged at test start; override with
UFFD_CHAOS_SEED=<n>.

Reproduce a failure:
  UFFD_CHAOS_SEED=<seed> go test -run=TestChaosCloseTerminatesUnderLatency \
    ./pkg/sandbox/uffd/userfaultfd/...
@cursor
Copy link
Copy Markdown

cursor Bot commented May 3, 2026

PR Summary

Low Risk
Adds new root-only userfaultfd tests (including property-based fuzzing and latency/teardown stress) plus a test dependency bump; production code paths are unchanged, but the new suites may increase CI flakiness or runtime if not properly gated.

Overview
Expands userfaultfd regression coverage by adding a seedable chaos test that injects random per-page I/O latency and asserts Serve reliably drains/shuts down under load, plus a property-based rapid state-machine test that fuzzes read/write/remove sequences and checks page-state and content invariants using an in-process handler harness. Updates Go module metadata to include pgregory.net/rapid (v1.3.0) across packages for these new tests.

Reviewed by Cursor Bugbot for commit c28fc1e. Bugbot is set up for automated code reviews on this repo. Configure here.

@ValentaTomas
Copy link
Copy Markdown
Member Author

Folded into #2520. Rapid + chaos tests now use the existing cross-process testharness instead of an in-process handler.

@ValentaTomas ValentaTomas deleted the test/uffd-rapid-and-chaos-coverage branch May 3, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants