Skip to content

Introducing @reduce for group level reduction#379

Closed
brabreda wants to merge 6 commits intoJuliaGPU:release-0.8from
brabreda:release-0.8
Closed

Introducing @reduce for group level reduction#379
brabreda wants to merge 6 commits intoJuliaGPU:release-0.8from
brabreda:release-0.8

Conversation

@brabreda
Copy link
Copy Markdown

@brabreda brabreda commented Apr 5, 2023

The @reduce macro performs a group level reduction.

TODOs:

  • Figure out a place for the implementation.
  • Add a lane level reduction.
  • Create a more advanced group level reduction that is able to utilize platform dependant feature such as lane reduction and atomics.

Comment thread lib/CUDAKernels/src/CUDAKernels.jl Outdated
threadIdx = KernelAbstractions.@index(Local)

# shared mem for a complete reduction
shared = KernelAbstractions.@localmem(T, 1024)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is the moment we need dynamic shared memory support?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-ref: #11

Comment thread src/KernelAbstractions.jl
Comment thread lib/CUDAKernels/src/CUDAKernels.jl Outdated
# perform the reduction
d = 1
while d < threads
KernelAbstractions.@synchronize()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are inside CUDAKernels here and as such you can use CUDA.jl functionality directly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats correct! But a implementation with KA.jl macros would allow for a single implementation that can run on all supported back-end. Because of this I am not sure what the best place is for the code for this implementation.

Also, the main difference between different back-end would the size of local memory but the use of dynamic memory would be a solution to this.

@vchuravy
Copy link
Copy Markdown
Member

vchuravy commented Apr 6, 2023

Looks like a great start! Will have to add it to 0.9 but that can happen after you are happy with the initial implementation.

@brabreda
Copy link
Copy Markdown
Author

brabreda commented Apr 6, 2023

To make a more generalized @reduce operation, I would work with a Config struct. An example of this can be found in the GemmKernels.jl Config.

Based on this struct, the reduction could use atomics and lane/warp reductions.

@brabreda brabreda closed this by deleting the head repository Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants