Skip to content
Commit c37987b1 authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

Refactored CUDA parallel scan kernel

The input values are first copied into shared memory, reduced
sequentially across chunks, and scanned only at the end of the kernel.
This follows the upsweep-downsweep approach by Blelloch which is more
work-efficient. Also the distinction between exclusive and inclusive
scan appears only at the end of the kernel, which avoids the weird "+2"
size of the shared memory.

Also used Cuda::getInterleaving() for the indices when accessing the
chunkResults array, which avoids shared memory banks conflicts in the
spine-scan phase.
parent 7a688833
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment