- Aug 06, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
This way we test both the general CUDA implementation using shared memory and the specialization using __shfl instructions. Both the reduction and scan kernels needed some tweaks due to shared memory usage with non-fundamental types.
-
Jakub Klinkovský authored
This is needed because custom specializations of std::is_arithmetic cannot be used (they cause an undefined behaviour).
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
- Aug 03, 2021
-
-
Jakub Klinkovský authored
The algorithms are implemented as plain functions in TNL::Algorithms. containsValue was replaced with contains.
-
- Jul 31, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Removed reduction methods from Array and ArrayView, instead added overloads of reduce and reduceWithArgument for arrays/views Plain functions are much more flexible than methods. The methods were also violating the open-closed principle: https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
This adds back the original approach (prescan + uniform shift) which was removed too early.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The original approach (prescan + uniform shift) is more efficient for inputs that are expensive to evaluate, such as vector expressions.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Using an odd number of valuesPerThread avoids shared memory bank conflicts even without a special interleaving. We also save some shared memory this way. Small inputs can be scanned with just one CUDA block, which avoids the scan of block results and second-phase kernel. Hence, large arrays can be scanned with just 3 kernel launches instead of 4.
-
Jakub Klinkovský authored
The input values are first copied into shared memory, reduced sequentially across chunks, and scanned only at the end of the kernel. This follows the upsweep-downsweep approach by Blelloch which is more work-efficient. Also the distinction between exclusive and inclusive scan appears only at the end of the kernel, which avoids the weird "+2" size of the shared memory. Also used Cuda::getInterleaving() for the indices when accessing the chunkResults array, which avoids shared memory banks conflicts in the spine-scan phase.
-
Jakub Klinkovský authored
- input and output are passed by views rather than raw pointers (this allows to scan even vector expressions) - consequently, indexing is different (begin and end for the global memory accesses) - fixed calculation of currentSize in the launcher - changed configuration of the kernel using the blockSize and valuesPerThread template parameters rather than the elementsInBlock runtime parameter - changed allocation of the shared memory from dynamic to static - the second phase kernel uses shared memory to cache block results for each block
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The latter is the standard name for it and it is hidden from the generated documentation of the public interface.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The algorithms are supposed to be used via overloaded plain functions in the Algorithms namespace: for now, there are only inplaceInclusiveScan and inplaceExclusiveScan (and their distributed variant). The scan and segmentedScan methods were removed from data structures (Vector, VectorView, DistributedVector, DistributedVectorView). They were inflexible (only std::plus was actually used for reduction), incomplete (some overloads just threw NotImplementedError), and they were violating the open-closed principle: https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Also moved the test under Algorithms and made sure it is actually being compiled.
-
Jakub Klinkovský authored
The first phase performs only per-block reduction, not scan. The output array elements are written only in the second phase, so overall we perform only `n` instead of `2n` write operations.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Also fixed the idempotent values for Max and MaxWithArg (std::numerical_limits<T>::lowest() vs std::numerical_limits<T>::min())
-