Skip to content

Rework memory reclamation API; reduce memory usage#3118

Draft
maleadt wants to merge 8 commits intomasterfrom
tb/reclaim
Draft

Rework memory reclamation API; reduce memory usage#3118
maleadt wants to merge 8 commits intomasterfrom
tb/reclaim

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 23, 2026

No description provided.

maleadt and others added 7 commits April 23, 2026 16:32
The previous 2 GiB/worker estimate ignored CUDA context and library overhead
plus peak allocations inside individual tests, letting the cap float up
to 64+ workers on large systems. A 4 GiB budget accounts for ~1 GiB
baseline (context + loaded libraries) and leaves room for multi-GiB peak
test allocations.

Also reset the device before querying free memory so the budget reflects
true capacity, and `@info`-log the computed budget (device, free memory,
gpu_jobs, cpu_threads, cpu_free) so users can see why the cap landed
where it did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related changes to reclaim GPU memory that was effectively leaking:

1. Refactor cuSOLVER state to use an object-bound finalizer

   The fat cusolverDnHandle (and analogous sparse/mg wrappers) held a
   `workspace_gpu::CuVector{UInt8}` that can grow to hundreds of MiB. Its
   finalizer was attached to `current_task()`, which on test workers never
   dies — so `CUDA.reclaim()` could empty idle handle caches but couldn't
   touch the pinned workspace. Probing showed ~200 MiB of pool memory per
   cuSOLVER-using worker sitting there permanently.

   Wrap the raw handles in `mutable struct` and attach the finalizer to
   the wrapper instead. Add a `pre_reclaim_hooks` mechanism in CUDACore
   so libraries can drop their TLS state on reclaim; `reclaim()` then
   runs these hooks + GC + regular hooks, so finalizers fire and the
   underlying buffers are released before the pool is trimmed. cuSOLVER
   registers a hook that clears its dense/sparse/mg TLS entries.

2. Cap the test-worker memory pool release threshold at 64 MiB

   The default `MEMPOOL_ATTR_RELEASE_THRESHOLD` is `typemax(UInt64)` —
   freed stream-ordered buffers stay cached indefinitely, inflating NVML
   readings. Set a 64 MiB cap on test workers so released buffers go back
   to the driver quickly. Trades a per-call pool-alloc cost for peak GPU
   RSS tracking the actual live working set.

Impact on library tests: big wins where workspace was hoarded (e.g.
cutensornet/contractions 1508 → 642 MiB), small/neutral elsewhere.
19813 library tests pass cleanly with no errors or warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same pattern as the prior cuSOLVER commit: wrap the raw handle in a mutable
struct with an object-bound finalizer, and register a `pre_reclaim_hooks`
entry that clears the library's TLS state. When CUDA.reclaim() runs, the
fat-handle wrapper becomes unreferenced, GC collects it, its finalizer
returns the raw handle to the idle cache, and the subsequent HandleCache
hook destroys the idle handle properly.

Memory savings for cuBLAS/cuSPARSE are smaller than cuSOLVER (no cached
workspace CuVector; just the ~handle), but this removes the last place
library state stays pinned across the worker's lifetime, and makes the
lifecycle consistent across libraries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the two untyped hook arrays (pre_reclaim_hooks, reclaim_hooks) and
the three copies of the phase ladder with a single Reclaimable abstract
type, a ReclaimLevel enum, and one reclaim(level) entry point. Libraries
register their HandleCache and TaskLocalCache instances from __init__ and
implement drop!/purge! via dispatch; CUDACore invokes them with per-hook
error isolation, modelled on Base.atexit_hooks. retry_reclaim walks
ReclaimLevel one rung at a time.

Convert cuDNN, cuRAND, cuTENSOR, cuTensorNet, and cuStateVec from the
task-bound finalizer pattern (finalizer(current_task(), …)) to the
object-bound fat-handle pattern used by cuBLAS/cuSPARSE/cuSOLVER. The old
task-bound variant only released resources once the owning task was
garbage-collected, so those libraries previously leaked memory under
pressure; they now participate in RECLAIM_DROP_STATE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inspired by PyTorch's `FreeMemoryCallback` (see
c10/cuda/CUDACachingAllocator.cpp), which invokes registered callbacks
before any expensive allocation work. `RECLAIM_PURGE_IDLE` sits at the
bottom of the ladder and runs `purge!` on every `Reclaimable` without
touching sync, GC, or pool trim. If a library happens to be holding
idle cached handles from dead tasks, we get them back immediately; if
not, it's a cheap no-op and we escalate.

The existing `RECLAIM_PURGE_CACHES` rung stays in place at its former
position: after `RECLAIM_GC_{MINOR,FULL}` have run, handle wrappers
from dead tasks get finalized and their raw handles end up in the
caches. That rung now catches those new entries before we'd have to
escalate all the way to `RECLAIM_DROP_STATE` (which also clears the
current task's TLS).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt changed the title Rework memory reclamation API Rework memory reclamation API; reduce memory usage Apr 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 33.03571% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 10.17%. Comparing base (22b2689) to head (9d8584e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
lib/cusolver/src/cuSOLVER.jl 17.24% 24 Missing ⚠️
lib/cudnn/src/cuDNN.jl 0.00% 9 Missing ⚠️
lib/custatevec/src/cuStateVec.jl 0.00% 9 Missing ⚠️
lib/cutensor/src/cuTENSOR.jl 0.00% 9 Missing ⚠️
lib/cutensornet/src/cuTensorNet.jl 0.00% 9 Missing ⚠️
lib/cusparse/src/cuSPARSE.jl 20.00% 8 Missing ⚠️
lib/cublas/src/cuBLAS.jl 63.15% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3118      +/-   ##
==========================================
- Coverage   16.43%   10.17%   -6.27%     
==========================================
  Files         123      122       -1     
  Lines        9678     9296     -382     
==========================================
- Hits         1591      946     -645     
- Misses       8087     8350     +263     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant