spline: upload invariant point/weight data to GPU once on dask+cupy path#2929
Open
brendancol wants to merge 3 commits into
Open
Conversation
The dask+cupy backend re-uploaded the input point coordinates and the solved TPS weight vector to the GPU once per chunk. These arrays are the same for every chunk, so a raster split into N tiles copied them across PCIe N times. Add _tps_evaluate_gpu, which takes already-on-device point/weight arrays and uploads only the per-chunk grid slices. _spline_cupy uploads once then delegates; _spline_dask_cupy uploads the invariants once before the per-chunk closure and reuses them. Verified: dask+cupy invariant uploads drop from 3*n_chunks to 3 (16x on a 16-chunk grid); numpy/cupy/dask+cupy parity holds to ~1e-14. Adds a regression test plus cupy and dask+cupy parity tests for spline. scope=spline-only
brendancol
commented
Jun 4, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: spline upload invariant point/weight data to GPU once on dask+cupy path
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
- The
_chunkclosure in_spline_dask_cupy(xrspatial/interpolate/_spline.py:229) now captures the cupy device arraysx_gpu,y_gpu,w_gpu. Under the threaded or synchronous scheduler this is the intended win: the buffers are shared by reference and uploaded once. Under a distributed scheduler the closure is pickled per task, and pickling a cupy array round-trips through host memory, which would reintroduce a per-task transfer. The previous code uploaded inside the worker, so this is not a regression for the common single-GPU path. A one-line comment noting the threaded-scheduler assumption would help the next reader.
Nits (optional improvements)
None.
What looks good
- The refactor is behavior-preserving:
_tps_evaluate_gpuis the old_spline_cupybody minus the three invariant uploads, which moved to the callers. The CUDA kernel and its bounds guard are untouched. - Verified numerically: numpy, cupy, and dask+cupy agree to ~1e-14.
- The regression test asserts exactly three invariant uploads regardless of chunk count, and it fails against the pre-fix code (48 != 3), so it actually guards the fix.
- The PR also adds cupy and dask+cupy parity tests for spline, which the suite did not have before.
Checklist
- Algorithm matches reference: no algorithm change, evaluation math identical
- All implemented backends produce consistent results: verified to ~1e-14
- NaN handling is correct: unchanged
- Edge cases are covered by tests: single point, n<3, memory guard already covered
- Dask chunk boundaries handled correctly: grid slices unchanged, per-chunk shape preserved
- No premature materialization or unnecessary copies: this is the fix
- Benchmark exists or is not needed: no benchmark; micro-optimization on an existing path
- README feature matrix updated (if applicable): not applicable, no API change
- Docstrings present and accurate: new helper documented
…review follow-up)
brendancol
commented
Jun 4, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review (follow-up pass)
The earlier suggestion is addressed: _spline_dask_cupy now carries a comment spelling out the threaded/synchronous-scheduler assumption behind the closure capture, and notes that a distributed scheduler is no worse than the previous per-chunk upload.
No remaining blockers, suggestions, or nits. Spline tests stay green (8 passed in TestSpline, including the cupy and dask+cupy parity tests and the upload-count regression).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The thin plate spline interpolator's dask+cupy backend re-uploaded the input point coordinates and the solved TPS weight vector to the GPU once per dask chunk. Those arrays are the same for every chunk, so a raster split into N tiles copied them across PCIe N times.
This adds
_tps_evaluate_gpu, which takes the point and weight arrays already on the device and uploads only the per-chunk grid slices._spline_cupyuploads once then delegates;_spline_dask_cupyuploads the invariants once before the per-chunk closure and reuses them.Backend coverage
Verification
Test plan
pytest xrspatial/tests/test_interpolation.py(44 passed)pytest xrspatial/tests/test_interpolation.py::TestSplinewith real cupy on a CUDA host (8 passed)Part of a deep-sweep performance audit (scope=spline-only). No issue was filed for this finding; it was surfaced directly by the audit.