Refactored CUDA parallel scan kernel (c37987b1) · Commits · TNL / tnl-dev

Commit c37987b1 authored Jul 18, 2021 by

Jakub Klinkovský

Refactored CUDA parallel scan kernel

The input values are first copied into shared memory, reduced
sequentially across chunks, and scanned only at the end of the kernel.
This follows the upsweep-downsweep approach by Blelloch which is more
work-efficient. Also the distinction between exclusive and inclusive
scan appears only at the end of the kernel, which avoids the weird "+2"
size of the shared memory.

Also used Cuda::getInterleaving() for the indices when accessing the
chunkResults array, which avoids shared memory banks conflicts in the
spine-scan phase.

parent 7a688833

Hide whitespace changes

Inline Side-by-side

Please register or to comment