Rework memory reclamation API; reduce memory usage by maleadt · Pull Request #3118 · JuliaGPU/CUDA.jl

maleadt · 2026-04-23T14:33:14Z

No description provided.

The previous 2 GiB/worker estimate ignored CUDA context and library overhead plus peak allocations inside individual tests, letting the cap float up to 64+ workers on large systems. A 4 GiB budget accounts for ~1 GiB baseline (context + loaded libraries) and leaves room for multi-GiB peak test allocations. Also reset the device before querying free memory so the budget reflects true capacity, and `@info`-log the computed budget (device, free memory, gpu_jobs, cpu_threads, cpu_free) so users can see why the cap landed where it did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related changes to reclaim GPU memory that was effectively leaking: 1. Refactor cuSOLVER state to use an object-bound finalizer The fat cusolverDnHandle (and analogous sparse/mg wrappers) held a `workspace_gpu::CuVector{UInt8}` that can grow to hundreds of MiB. Its finalizer was attached to `current_task()`, which on test workers never dies — so `CUDA.reclaim()` could empty idle handle caches but couldn't touch the pinned workspace. Probing showed ~200 MiB of pool memory per cuSOLVER-using worker sitting there permanently. Wrap the raw handles in `mutable struct` and attach the finalizer to the wrapper instead. Add a `pre_reclaim_hooks` mechanism in CUDACore so libraries can drop their TLS state on reclaim; `reclaim()` then runs these hooks + GC + regular hooks, so finalizers fire and the underlying buffers are released before the pool is trimmed. cuSOLVER registers a hook that clears its dense/sparse/mg TLS entries. 2. Cap the test-worker memory pool release threshold at 64 MiB The default `MEMPOOL_ATTR_RELEASE_THRESHOLD` is `typemax(UInt64)` — freed stream-ordered buffers stay cached indefinitely, inflating NVML readings. Set a 64 MiB cap on test workers so released buffers go back to the driver quickly. Trades a per-call pool-alloc cost for peak GPU RSS tracking the actual live working set. Impact on library tests: big wins where workspace was hoarded (e.g. cutensornet/contractions 1508 → 642 MiB), small/neutral elsewhere. 19813 library tests pass cleanly with no errors or warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same pattern as the prior cuSOLVER commit: wrap the raw handle in a mutable struct with an object-bound finalizer, and register a `pre_reclaim_hooks` entry that clears the library's TLS state. When CUDA.reclaim() runs, the fat-handle wrapper becomes unreferenced, GC collects it, its finalizer returns the raw handle to the idle cache, and the subsequent HandleCache hook destroys the idle handle properly. Memory savings for cuBLAS/cuSPARSE are smaller than cuSOLVER (no cached workspace CuVector; just the ~handle), but this removes the last place library state stays pinned across the worker's lifetime, and makes the lifecycle consistent across libraries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the two untyped hook arrays (pre_reclaim_hooks, reclaim_hooks) and the three copies of the phase ladder with a single Reclaimable abstract type, a ReclaimLevel enum, and one reclaim(level) entry point. Libraries register their HandleCache and TaskLocalCache instances from __init__ and implement drop!/purge! via dispatch; CUDACore invokes them with per-hook error isolation, modelled on Base.atexit_hooks. retry_reclaim walks ReclaimLevel one rung at a time. Convert cuDNN, cuRAND, cuTENSOR, cuTensorNet, and cuStateVec from the task-bound finalizer pattern (finalizer(current_task(), …)) to the object-bound fat-handle pattern used by cuBLAS/cuSPARSE/cuSOLVER. The old task-bound variant only released resources once the owning task was garbage-collected, so those libraries previously leaked memory under pressure; they now participate in RECLAIM_DROP_STATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inspired by PyTorch's `FreeMemoryCallback` (see c10/cuda/CUDACachingAllocator.cpp), which invokes registered callbacks before any expensive allocation work. `RECLAIM_PURGE_IDLE` sits at the bottom of the ladder and runs `purge!` on every `Reclaimable` without touching sync, GC, or pool trim. If a library happens to be holding idle cached handles from dead tasks, we get them back immediately; if not, it's a cheap no-op and we escalate. The existing `RECLAIM_PURGE_CACHES` rung stays in place at its former position: after `RECLAIM_GC_{MINOR,FULL}` have run, handle wrappers from dead tasks get finalized and their raw handles end up in the caches. That rung now catches those new entries before we'd have to escalate all the way to `RECLAIM_DROP_STATE` (which also clears the current task's TLS). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-23T21:04:50Z

Codecov Report

❌ Patch coverage is 33.03571% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 10.17%. Comparing base (22b2689) to head (9d8584e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/cusolver/src/cuSOLVER.jl	17.24%	24 Missing ⚠️
lib/cudnn/src/cuDNN.jl	0.00%	9 Missing ⚠️
lib/custatevec/src/cuStateVec.jl	0.00%	9 Missing ⚠️
lib/cutensor/src/cuTENSOR.jl	0.00%	9 Missing ⚠️
lib/cutensornet/src/cuTensorNet.jl	0.00%	9 Missing ⚠️
lib/cusparse/src/cuSPARSE.jl	20.00%	8 Missing ⚠️
lib/cublas/src/cuBLAS.jl	63.15%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3118      +/-   ##
==========================================
- Coverage   16.43%   10.17%   -6.27%     
==========================================
  Files         123      122       -1     
  Lines        9678     9296     -382     
==========================================
- Hits         1591      946     -645     
- Misses       8087     8350     +263

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt and others added 7 commits April 23, 2026 16:32

Relax memory requirement.

32287a1

Lower memory budget.

edcd56d

maleadt changed the title ~~Rework memory reclamation API~~ Rework memory reclamation API; reduce memory usage Apr 23, 2026

Rename handle types to avoid name collisions.

9d8584e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework memory reclamation API; reduce memory usage#3118

Rework memory reclamation API; reduce memory usage#3118
maleadt wants to merge 8 commits intomasterfrom
tb/reclaim

maleadt commented Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 23, 2026

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 23, 2026 •

edited

Loading