Introducing @reduce for group level reduction#379
Introducing @reduce for group level reduction#379brabreda wants to merge 6 commits intoJuliaGPU:release-0.8from brabreda:release-0.8
Conversation
| threadIdx = KernelAbstractions.@index(Local) | ||
|
|
||
| # shared mem for a complete reduction | ||
| shared = KernelAbstractions.@localmem(T, 1024) |
There was a problem hiding this comment.
Maybe this is the moment we need dynamic shared memory support?
| # perform the reduction | ||
| d = 1 | ||
| while d < threads | ||
| KernelAbstractions.@synchronize() |
There was a problem hiding this comment.
You are inside CUDAKernels here and as such you can use CUDA.jl functionality directly.
There was a problem hiding this comment.
Thats correct! But a implementation with KA.jl macros would allow for a single implementation that can run on all supported back-end. Because of this I am not sure what the best place is for the code for this implementation.
Also, the main difference between different back-end would the size of local memory but the use of dynamic memory would be a solution to this.
|
Looks like a great start! Will have to add it to |
|
To make a more generalized @reduce operation, I would work with a Config struct. An example of this can be found in the GemmKernels.jl Config. Based on this struct, the reduction could use atomics and lane/warp reductions. |
The @reduce macro performs a group level reduction.
TODOs: