Refactored CUDA parallel scan kernel (429bd511) · Commits · TNL / tnl-dev

Commit 429bd511 authored Jul 19, 2021 by

Jakub Klinkovský

Refactored CUDA parallel scan kernel

Using an odd number of valuesPerThread avoids shared memory bank
conflicts even without a special interleaving. We also save some shared
memory this way.

Small inputs can be scanned with just one CUDA block, which avoids the
scan of block results and second-phase kernel. Hence, large arrays can
be scanned with just 3 kernel launches instead of 4.

parent c37987b1

Hide whitespace changes

Inline Side-by-side

Please register or to comment