- Jul 31, 2021
-
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Removed reduction methods from Array and ArrayView, instead added overloads of reduce and reduceWithArgument for arrays/views Plain functions are much more flexible than methods. The methods were also violating the open-closed principle: https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
This adds back the original approach (prescan + uniform shift) which was removed too early.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The original approach (prescan + uniform shift) is more efficient for inputs that are expensive to evaluate, such as vector expressions.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Using an odd number of valuesPerThread avoids shared memory bank conflicts even without a special interleaving. We also save some shared memory this way. Small inputs can be scanned with just one CUDA block, which avoids the scan of block results and second-phase kernel. Hence, large arrays can be scanned with just 3 kernel launches instead of 4.
-
Jakub Klinkovský authored
The input values are first copied into shared memory, reduced sequentially across chunks, and scanned only at the end of the kernel. This follows the upsweep-downsweep approach by Blelloch which is more work-efficient. Also the distinction between exclusive and inclusive scan appears only at the end of the kernel, which avoids the weird "+2" size of the shared memory. Also used Cuda::getInterleaving() for the indices when accessing the chunkResults array, which avoids shared memory banks conflicts in the spine-scan phase.
-
Jakub Klinkovský authored
- input and output are passed by views rather than raw pointers (this allows to scan even vector expressions) - consequently, indexing is different (begin and end for the global memory accesses) - fixed calculation of currentSize in the launcher - changed configuration of the kernel using the blockSize and valuesPerThread template parameters rather than the elementsInBlock runtime parameter - changed allocation of the shared memory from dynamic to static - the second phase kernel uses shared memory to cache block results for each block
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The latter is the standard name for it and it is hidden from the generated documentation of the public interface.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The algorithms are supposed to be used via overloaded plain functions in the Algorithms namespace: for now, there are only inplaceInclusiveScan and inplaceExclusiveScan (and their distributed variant). The scan and segmentedScan methods were removed from data structures (Vector, VectorView, DistributedVector, DistributedVectorView). They were inflexible (only std::plus was actually used for reduction), incomplete (some overloads just threw NotImplementedError), and they were violating the open-closed principle: https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Also moved the test under Algorithms and made sure it is actually being compiled.
-
Jakub Klinkovský authored
The first phase performs only per-block reduction, not scan. The output array elements are written only in the second phase, so overall we perform only `n` instead of `2n` write operations.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
Also fixed the idempotent values for Max and MaxWithArg (std::numerical_limits<T>::lowest() vs std::numerical_limits<T>::min())
-
Jakub Klinkovský authored
Hence, all StaticArray, Array, ArrayView and even expression templates are directly usable in reduction without the need to create a wrapping fetch functor. Also NDArray has this interface in 1D.
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
-
Jakub Klinkovský authored
The tests should not rely on other parts of the library if possible.
-
Jakub Klinkovský authored
- sequential scan does not need to be split, so "perform" performs the whole simple scan algorithm, "performFirstPhase" only reduces the block (i.e. the whole vector), "performSecondPhase" performs the scan operation with the block result combined with a global offset as the initial value - parallel OpenMP scan calls the sequential scan to process the block results - parallel CUDA scan was changed such that the block results array is an exclusive scan after the first phase, same as in the other device specializations
-
Jakub Klinkovský authored
-