Refactored CUDA parallel scan kernel (7a688833) · Commits · TNL / tnl-dev

Commit 7a688833 authored Jul 17, 2021 by

Jakub Klinkovský

Refactored CUDA parallel scan kernel

- input and output are passed by views rather than raw pointers (this
  allows to scan even vector expressions)
- consequently, indexing is different (begin and end for the global
  memory accesses)
- fixed calculation of currentSize in the launcher
- changed configuration of the kernel using the blockSize and
  valuesPerThread template parameters rather than the elementsInBlock
  runtime parameter
- changed allocation of the shared memory from dynamic to static
- the second phase kernel uses shared memory to cache block results for
  each block

parent 76a95d0b

Expand all Hide whitespace changes

Inline Side-by-side

Please register or to comment