Refactored CUDA parallel scan kernel
The input values are first copied into shared memory, reduced sequentially across chunks, and scanned only at the end of the kernel. This follows the upsweep-downsweep approach by Blelloch which is more work-efficient. Also the distinction between exclusive and inclusive scan appears only at the end of the kernel, which avoids the weird "+2" size of the shared memory. Also used Cuda::getInterleaving() for the indices when accessing the chunkResults array, which avoids shared memory banks conflicts in the spine-scan phase.
parent
7a688833
Please register or sign in to comment