Skip to content

Commit 68037d2

Browse files
committed
docs(uffd): note slice-retry-under-RLock latency in faultPage
Slice retries (up to ~2s exponential backoff) run while the calling worker holds settleRequests.RLock, which delays REMOVE batch processing by the same amount. Correctness is unaffected (uffd-ring FIFO still serialises any subsequent same-page fault), but if this latency ever shows up in production metrics the documented fix lives here.
1 parent da5a856 commit 68037d2

1 file changed

Lines changed: 9 additions & 0 deletions

File tree

packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -409,6 +409,15 @@ func (u *Userfaultfd) faultPage(
409409
case source == nil && u.pageSize == header.HugepageSize:
410410
writeErr = u.fd.copy(addr, u.pageSize, header.EmptyHugePage, mode)
411411
default:
412+
// NOTE: this slice retry runs while the calling worker holds
413+
// settleRequests.RLock, which means a concurrent REMOVE batch
414+
// (which needs the write lock) is blocked for the full retry
415+
// budget — up to ~2s of exponential backoff. Correctness is
416+
// fine: a delayed REMOVE batch still applies before any
417+
// subsequent fault on the same page (uffd-ring FIFO). If the
418+
// REMOVE-blocking latency ever shows up in metrics, the fix
419+
// is to move Slice outside the lock and re-check pageTracker
420+
// state after re-acquiring it before issuing UFFDIO_COPY.
412421
var b []byte
413422
var dataErr error
414423
var attempt int

0 commit comments

Comments
 (0)