Rework memory reclamation API; reduce memory usage#3118
Draft
Rework memory reclamation API; reduce memory usage#3118
Conversation
The previous 2 GiB/worker estimate ignored CUDA context and library overhead plus peak allocations inside individual tests, letting the cap float up to 64+ workers on large systems. A 4 GiB budget accounts for ~1 GiB baseline (context + loaded libraries) and leaves room for multi-GiB peak test allocations. Also reset the device before querying free memory so the budget reflects true capacity, and `@info`-log the computed budget (device, free memory, gpu_jobs, cpu_threads, cpu_free) so users can see why the cap landed where it did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes to reclaim GPU memory that was effectively leaking:
1. Refactor cuSOLVER state to use an object-bound finalizer
The fat cusolverDnHandle (and analogous sparse/mg wrappers) held a
`workspace_gpu::CuVector{UInt8}` that can grow to hundreds of MiB. Its
finalizer was attached to `current_task()`, which on test workers never
dies — so `CUDA.reclaim()` could empty idle handle caches but couldn't
touch the pinned workspace. Probing showed ~200 MiB of pool memory per
cuSOLVER-using worker sitting there permanently.
Wrap the raw handles in `mutable struct` and attach the finalizer to
the wrapper instead. Add a `pre_reclaim_hooks` mechanism in CUDACore
so libraries can drop their TLS state on reclaim; `reclaim()` then
runs these hooks + GC + regular hooks, so finalizers fire and the
underlying buffers are released before the pool is trimmed. cuSOLVER
registers a hook that clears its dense/sparse/mg TLS entries.
2. Cap the test-worker memory pool release threshold at 64 MiB
The default `MEMPOOL_ATTR_RELEASE_THRESHOLD` is `typemax(UInt64)` —
freed stream-ordered buffers stay cached indefinitely, inflating NVML
readings. Set a 64 MiB cap on test workers so released buffers go back
to the driver quickly. Trades a per-call pool-alloc cost for peak GPU
RSS tracking the actual live working set.
Impact on library tests: big wins where workspace was hoarded (e.g.
cutensornet/contractions 1508 → 642 MiB), small/neutral elsewhere.
19813 library tests pass cleanly with no errors or warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pattern as the prior cuSOLVER commit: wrap the raw handle in a mutable struct with an object-bound finalizer, and register a `pre_reclaim_hooks` entry that clears the library's TLS state. When CUDA.reclaim() runs, the fat-handle wrapper becomes unreferenced, GC collects it, its finalizer returns the raw handle to the idle cache, and the subsequent HandleCache hook destroys the idle handle properly. Memory savings for cuBLAS/cuSPARSE are smaller than cuSOLVER (no cached workspace CuVector; just the ~handle), but this removes the last place library state stays pinned across the worker's lifetime, and makes the lifecycle consistent across libraries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the two untyped hook arrays (pre_reclaim_hooks, reclaim_hooks) and the three copies of the phase ladder with a single Reclaimable abstract type, a ReclaimLevel enum, and one reclaim(level) entry point. Libraries register their HandleCache and TaskLocalCache instances from __init__ and implement drop!/purge! via dispatch; CUDACore invokes them with per-hook error isolation, modelled on Base.atexit_hooks. retry_reclaim walks ReclaimLevel one rung at a time. Convert cuDNN, cuRAND, cuTENSOR, cuTensorNet, and cuStateVec from the task-bound finalizer pattern (finalizer(current_task(), …)) to the object-bound fat-handle pattern used by cuBLAS/cuSPARSE/cuSOLVER. The old task-bound variant only released resources once the owning task was garbage-collected, so those libraries previously leaked memory under pressure; they now participate in RECLAIM_DROP_STATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inspired by PyTorch's `FreeMemoryCallback` (see
c10/cuda/CUDACachingAllocator.cpp), which invokes registered callbacks
before any expensive allocation work. `RECLAIM_PURGE_IDLE` sits at the
bottom of the ladder and runs `purge!` on every `Reclaimable` without
touching sync, GC, or pool trim. If a library happens to be holding
idle cached handles from dead tasks, we get them back immediately; if
not, it's a cheap no-op and we escalate.
The existing `RECLAIM_PURGE_CACHES` rung stays in place at its former
position: after `RECLAIM_GC_{MINOR,FULL}` have run, handle wrappers
from dead tasks get finalized and their raw handles end up in the
caches. That rung now catches those new entries before we'd have to
escalate all the way to `RECLAIM_DROP_STATE` (which also clears the
current task's TLS).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3118 +/- ##
==========================================
- Coverage 16.43% 10.17% -6.27%
==========================================
Files 123 122 -1
Lines 9678 9296 -382
==========================================
- Hits 1591 946 -645
- Misses 8087 8350 +263 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.