Skip to content
Commit 429bd511 authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

Refactored CUDA parallel scan kernel

Using an odd number of valuesPerThread avoids shared memory bank
conflicts even without a special interleaving. We also save some shared
memory this way.

Small inputs can be scanned with just one CUDA block, which avoids the
scan of block results and second-phase kernel. Hence, large arrays can
be scanned with just 3 kernel launches instead of 4.
parent c37987b1
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment