docs(uffd): note slice-retry-under-RLock latency in faultPage

ValentaTomas · ValentaTomas · commit 68037d21e1ec · 2026-05-02T22:28:08.000-07:00
Slice retries (up to ~2s exponential backoff) run while the calling
worker holds settleRequests.RLock, which delays REMOVE batch
processing by the same amount. Correctness is unaffected (uffd-ring
FIFO still serialises any subsequent same-page fault), but if this
latency ever shows up in production metrics the documented fix lives
here.
diff --git a/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go b/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
@@ -409,6 +409,15 @@ func (u *Userfaultfd) faultPage(
 	case source == nil && u.pageSize == header.HugepageSize:
 		writeErr = u.fd.copy(addr, u.pageSize, header.EmptyHugePage, mode)
 	default:
+		// NOTE: this slice retry runs while the calling worker holds
+		// settleRequests.RLock, which means a concurrent REMOVE batch
+		// (which needs the write lock) is blocked for the full retry
+		// budget — up to ~2s of exponential backoff. Correctness is
+		// fine: a delayed REMOVE batch still applies before any
+		// subsequent fault on the same page (uffd-ring FIFO). If the
+		// REMOVE-blocking latency ever shows up in metrics, the fix
+		// is to move Slice outside the lock and re-check pageTracker
+		// state after re-acquiring it before issuing UFFDIO_COPY.
 		var b []byte
 		var dataErr error
 		var attempt int