Refactored CUDA parallel scan kernel
- input and output are passed by views rather than raw pointers (this allows to scan even vector expressions) - consequently, indexing is different (begin and end for the global memory accesses) - fixed calculation of currentSize in the launcher - changed configuration of the kernel using the blockSize and valuesPerThread template parameters rather than the elementsInBlock runtime parameter - changed allocation of the shared memory from dynamic to static - the second phase kernel uses shared memory to cache block results for each block
parent
76a95d0b
Please register or sign in to comment