- Sep 02, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Sep 01, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Aug 31, 2021
-
-
Jakub Klinkovský authored
-
- Aug 27, 2021
-
-
Jakub Klinkovský authored
Amends e5fc6a96
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Aug 11, 2021
-
-
Jakub Klinkovský authored
Scan refactoring Closes #87 See merge request !100
-
- Aug 08, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Aug 06, 2021
-
-
Jakub Klinkovský authored
- structs from HorizontalOperations.h reimplemented as function objects in Functional.h - repetitive function definitions generated using macros - added new operators: % (modulus) and ^ (xor)
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
This way we test both the general CUDA implementation using shared memory and the specialization using __shfl instructions. Both the reduction and scan kernels needed some tweaks due to shared memory usage with non-fundamental types.
-
Jakub Klinkovský authored
This is needed because custom specializations of std::is_arithmetic cannot be used (they cause an undefined behaviour).
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Aug 03, 2021
-
-
Jakub Klinkovský authored
The algorithms are implemented as plain functions in TNL::Algorithms. containsValue was replaced with contains.
-
- Jul 31, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Removed reduction methods from Array and ArrayView, instead added overloads of reduce and reduceWithArgument for arrays/views Plain functions are much more flexible than methods. The methods were also violating the open-closed principle: https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
This adds back the original approach (prescan + uniform shift) which was removed too early.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The original approach (prescan + uniform shift) is more efficient for inputs that are expensive to evaluate, such as vector expressions.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Using an odd number of valuesPerThread avoids shared memory bank conflicts even without a special interleaving. We also save some shared memory this way. Small inputs can be scanned with just one CUDA block, which avoids the scan of block results and second-phase kernel. Hence, large arrays can be scanned with just 3 kernel launches instead of 4.
-
Jakub Klinkovský authored
The input values are first copied into shared memory, reduced sequentially across chunks, and scanned only at the end of the kernel. This follows the upsweep-downsweep approach by Blelloch which is more work-efficient. Also the distinction between exclusive and inclusive scan appears only at the end of the kernel, which avoids the weird "+2" size of the shared memory. Also used Cuda::getInterleaving() for the indices when accessing the chunkResults array, which avoids shared memory banks conflicts in the spine-scan phase.
-
Jakub Klinkovský authored
- input and output are passed by views rather than raw pointers (this allows to scan even vector expressions) - consequently, indexing is different (begin and end for the global memory accesses) - fixed calculation of currentSize in the launcher - changed configuration of the kernel using the blockSize and valuesPerThread template parameters rather than the elementsInBlock runtime parameter - changed allocation of the shared memory from dynamic to static - the second phase kernel uses shared memory to cache block results for each block
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-