Refactored CUDA parallel scan kernel
Using an odd number of valuesPerThread avoids shared memory bank conflicts even without a special interleaving. We also save some shared memory this way. Small inputs can be scanned with just one CUDA block, which avoids the scan of block results and second-phase kernel. Hence, large arrays can be scanned with just 3 kernel launches instead of 4.
parent
c37987b1
Please register or sign in to comment