Skip to content

spline: upload invariant point/weight data to GPU once on dask+cupy path#2929

Open
brendancol wants to merge 3 commits into
xarray-contrib:mainfrom
brendancol:deep-sweep-performance-interpolate_spline-2026-06-04
Open

spline: upload invariant point/weight data to GPU once on dask+cupy path#2929
brendancol wants to merge 3 commits into
xarray-contrib:mainfrom
brendancol:deep-sweep-performance-interpolate_spline-2026-06-04

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

What

The thin plate spline interpolator's dask+cupy backend re-uploaded the input point coordinates and the solved TPS weight vector to the GPU once per dask chunk. Those arrays are the same for every chunk, so a raster split into N tiles copied them across PCIe N times.

This adds _tps_evaluate_gpu, which takes the point and weight arrays already on the device and uploads only the per-chunk grid slices. _spline_cupy uploads once then delegates; _spline_dask_cupy uploads the invariants once before the per-chunk closure and reuses them.

Backend coverage

  • numpy: unchanged
  • cupy (single array): unchanged behaviour, now routes through the shared GPU evaluator
  • dask+numpy: unchanged
  • dask+cupy: invariant uploads drop from 3 per chunk to 3 total

Verification

  • dask+cupy invariant point/weight uploads on a 16-chunk grid: 48 before, 3 after
  • numpy / cupy / dask+cupy parity holds to ~1e-14
  • New regression test asserts the dask+cupy path uploads the point/weight arrays exactly three times regardless of chunk count; it fails against the old code (48 != 3) and passes with the fix
  • Added cupy and dask+cupy parity tests for spline (the suite previously had neither)

Test plan

  • pytest xrspatial/tests/test_interpolation.py (44 passed)
  • pytest xrspatial/tests/test_interpolation.py::TestSpline with real cupy on a CUDA host (8 passed)
  • Regression test confirmed red without the source fix

Part of a deep-sweep performance audit (scope=spline-only). No issue was filed for this finding; it was surfaced directly by the audit.

The dask+cupy backend re-uploaded the input point coordinates and the
solved TPS weight vector to the GPU once per chunk. These arrays are the
same for every chunk, so a raster split into N tiles copied them across
PCIe N times.

Add _tps_evaluate_gpu, which takes already-on-device point/weight arrays
and uploads only the per-chunk grid slices. _spline_cupy uploads once
then delegates; _spline_dask_cupy uploads the invariants once before the
per-chunk closure and reuses them.

Verified: dask+cupy invariant uploads drop from 3*n_chunks to 3 (16x on
a 16-chunk grid); numpy/cupy/dask+cupy parity holds to ~1e-14. Adds a
regression test plus cupy and dask+cupy parity tests for spline.

scope=spline-only
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Jun 4, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: spline upload invariant point/weight data to GPU once on dask+cupy path

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • The _chunk closure in _spline_dask_cupy (xrspatial/interpolate/_spline.py:229) now captures the cupy device arrays x_gpu, y_gpu, w_gpu. Under the threaded or synchronous scheduler this is the intended win: the buffers are shared by reference and uploaded once. Under a distributed scheduler the closure is pickled per task, and pickling a cupy array round-trips through host memory, which would reintroduce a per-task transfer. The previous code uploaded inside the worker, so this is not a regression for the common single-GPU path. A one-line comment noting the threaded-scheduler assumption would help the next reader.

Nits (optional improvements)

None.

What looks good

  • The refactor is behavior-preserving: _tps_evaluate_gpu is the old _spline_cupy body minus the three invariant uploads, which moved to the callers. The CUDA kernel and its bounds guard are untouched.
  • Verified numerically: numpy, cupy, and dask+cupy agree to ~1e-14.
  • The regression test asserts exactly three invariant uploads regardless of chunk count, and it fails against the pre-fix code (48 != 3), so it actually guards the fix.
  • The PR also adds cupy and dask+cupy parity tests for spline, which the suite did not have before.

Checklist

  • Algorithm matches reference: no algorithm change, evaluation math identical
  • All implemented backends produce consistent results: verified to ~1e-14
  • NaN handling is correct: unchanged
  • Edge cases are covered by tests: single point, n<3, memory guard already covered
  • Dask chunk boundaries handled correctly: grid slices unchanged, per-chunk shape preserved
  • No premature materialization or unnecessary copies: this is the fix
  • Benchmark exists or is not needed: no benchmark; micro-optimization on an existing path
  • README feature matrix updated (if applicable): not applicable, no API change
  • Docstrings present and accurate: new helper documented

Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review (follow-up pass)

The earlier suggestion is addressed: _spline_dask_cupy now carries a comment spelling out the threaded/synchronous-scheduler assumption behind the closure capture, and notes that a distributed scheduler is no worse than the previous per-chunk upload.

No remaining blockers, suggestions, or nits. Spline tests stay green (8 passed in TestSpline, including the cupy and dask+cupy parity tests and the upload-count regression).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant