Commits · 0d2a57812d7f05d5ff1273121f2074356877ac91 · TNL / tnl-dev

Jul 31, 2021
- Removed old workaround for nvcc from CudaReductionKernel · 0d2a5781
  Jakub Klinkovský authored Jul 31, 2021
  
  0d2a5781
- Refactored reduction and scan in quicksort implementation · f0926de3
  Jakub Klinkovský authored Jul 28, 2021
  
  f0926de3
- Added specialization of CudaBlockScan using __shfl instructions · 1a64a618
  Jakub Klinkovský authored Jul 29, 2021
  
  1a64a618
- Added specialization of CudaBlockReduce using __shfl instructions · 4bbf495c
  Jakub Klinkovský authored Jul 29, 2021
  
  4bbf495c
- Removed useless vertical operations and used RemoveET in reduce.h · 05903a8f
  Jakub Klinkovský authored Jul 24, 2021
  
  05903a8f
- Reduction: renamed zero and idempotent to identity · 0d329226
  Jakub Klinkovský authored Jul 23, 2021
  
  0d329226
- Removed reduction methods from Array and ArrayView, instead added overloads of... · 9ce0b169
  Jakub Klinkovský authored Jul 23, 2021
```
Removed reduction methods from Array and ArrayView, instead added overloads of reduce and reduceWithArgument for arrays/views

Plain functions are much more flexible than methods. The methods were
also violating the open-closed principle:
https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
```
  9ce0b169
- Added definition of ValueType to expression templates · a4bddfff
  Jakub Klinkovský authored Jul 23, 2021
  
  a4bddfff
- Added default reduction functional (TNL::Plus) to reduce · d97cea88
  Jakub Klinkovský authored Jul 23, 2021
  
  d97cea88
- Optimized parallel OpenMP scan algorithm for expensive inputs · 4743a565
  Jakub Klinkovský authored Jul 22, 2021
```
This adds back the original approach (prescan + uniform shift) which
was removed too early.
```
  4743a565
- Added more scan operations to the BLAS benchmark · 8f8c301b
  Jakub Klinkovský authored Jul 22, 2021
  
  8f8c301b
- CudaScanKernelLauncher: configuration of blockSize depending on the ValueType · 57d66051
  Jakub Klinkovský authored Jul 22, 2021
  
  57d66051
- Optimized parallel CUDA scan algorithm to avoid unnecessary writing in the first phase · 8accbc52
  Jakub Klinkovský authored Jul 22, 2021
```
The original approach (prescan + uniform shift) is more efficient for
inputs that are expensive to evaluate, such as vector expressions.
```
  8accbc52
- Refactored CudaScanKernelFirstPhase - split out CudaBlockScan and CudaTileScan · 2f61104b
  Jakub Klinkovský authored Jul 20, 2021
  
  2f61104b
- Refactored CudaReductionKernel - split the parallel reduction into CudaBlockReduce · 3bd8fff5
  Jakub Klinkovský authored Jul 20, 2021
  
  3bd8fff5
- Optimized upper bound for the scan of warpResults in the CUDA parallel scan · addb7566
  Jakub Klinkovský authored Jul 19, 2021
  
  addb7566
- Refactored CUDA parallel scan kernel · 429bd511
  Jakub Klinkovský authored Jul 19, 2021
```
Using an odd number of valuesPerThread avoids shared memory bank
conflicts even without a special interleaving. We also save some shared
memory this way.

Small inputs can be scanned with just one CUDA block, which avoids the
scan of block results and second-phase kernel. Hence, large arrays can
be scanned with just 3 kernel launches instead of 4.
```
  429bd511
- Refactored CUDA parallel scan kernel · c37987b1
  Jakub Klinkovský authored Jul 18, 2021
```
The input values are first copied into shared memory, reduced
sequentially across chunks, and scanned only at the end of the kernel.
This follows the upsweep-downsweep approach by Blelloch which is more
work-efficient. Also the distinction between exclusive and inclusive
scan appears only at the end of the kernel, which avoids the weird "+2"
size of the shared memory.

Also used Cuda::getInterleaving() for the indices when accessing the
chunkResults array, which avoids shared memory banks conflicts in the
spine-scan phase.
```
  c37987b1
- Refactored CUDA parallel scan kernel · 7a688833
  Jakub Klinkovský authored Jul 17, 2021
```
- input and output are passed by views rather than raw pointers (this
  allows to scan even vector expressions)
- consequently, indexing is different (begin and end for the global
  memory accesses)
- fixed calculation of currentSize in the launcher
- changed configuration of the kernel using the blockSize and
  valuesPerThread template parameters rather than the elementsInBlock
  runtime parameter
- changed allocation of the shared memory from dynamic to static
- the second phase kernel uses shared memory to cache block results for
  each block
```
  7a688833
- Implemented 'outplace' variants of scan and distributed scan functions · 76a95d0b
  Jakub Klinkovský authored Jul 14, 2021
  
  76a95d0b
- Refactored and extended tests for scan and distributed scan · 4f1dc3af
  Jakub Klinkovský authored Jul 13, 2021
  
  4f1dc3af
- Fixed bug in the second phase of CUDA scan implementation · 4467323a
  Jakub Klinkovský authored Jul 14, 2021
  
  4467323a
- Segments: renamed namespace details to detail · c44b1140
  Jakub Klinkovský authored Jul 11, 2021
```
The latter is the standard name for it and it is hidden from the
generated documentation of the public interface.
```
  c44b1140
- Fixed header includes · 42734a75
  Jakub Klinkovský authored Jul 11, 2021
  
  42734a75
- Moved implementations of scan and distributed scan into the detail namespace · c1780697
  Jakub Klinkovský authored Jul 11, 2021
```
The algorithms are supposed to be used via overloaded plain functions in
the Algorithms namespace: for now, there are only inplaceInclusiveScan
and inplaceExclusiveScan (and their distributed variant).

The scan and segmentedScan methods were removed from data structures
(Vector, VectorView, DistributedVector, DistributedVectorView). They
were inflexible (only std::plus was actually used for reduction),
incomplete (some overloads just threw NotImplementedError), and they
were violating the open-closed principle:
https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
```
  c1780697
- Fixed copy-assignment operator in DistributedArrayView according to DistributedVectorView · 624e709f
  Jakub Klinkovský authored Jul 10, 2021
  
  624e709f
- Moved scan tests from Containers to Algorithms · 19e9b4e5
  Jakub Klinkovský authored Jul 10, 2021
  
  19e9b4e5
- Moved segmented scan into its own header file under Algorithms · e347b486
  Jakub Klinkovský authored Jul 10, 2021
```
Also moved the test under Algorithms and made sure it is actually
being compiled.
```
  e347b486
- Refactored parallel OpenMP scan · a4e15b08
  Jakub Klinkovský authored Jul 10, 2021
```
The first phase performs only per-block reduction, not scan. The output
array elements are written only in the second phase, so overall we
perform only `n` instead of `2n` write operations.
```
  a4e15b08
- Removed useless DeviceType from detail::Reduction · 63d567e4
  Jakub Klinkovský authored Jul 10, 2021
  
  63d567e4
- Fixed formatting in reduce.h and removed unused includes · 49839f6a
  Jakub Klinkovský authored Jul 10, 2021
  
  49839f6a
- reduce: fixed the Result type in case the fetch functor returns a reference · a3f0ad65
  Jakub Klinkovský authored Jul 10, 2021
  
  a3f0ad65
- Removed unnecessary lambda functions from expression templates · 72ad8e30
  Jakub Klinkovský authored Jul 10, 2021
  
  72ad8e30
- Added static_asserts to the getIdempotent methods in Functional.h · a1e3a62d
  Jakub Klinkovský authored Jul 10, 2021
```
Also fixed the idempotent values for Max and MaxWithArg
(std::numerical_limits<T>::lowest() vs std::numerical_limits<T>::min())
```
  a1e3a62d
- Added operator() to StaticArray, Array, ArrayView and ExpressionTemplates · 090a8f29
  Jakub Klinkovský authored Jul 10, 2021
```
Hence, all StaticArray, Array, ArrayView and even expression templates are
directly usable in reduction without the need to create a wrapping fetch
functor. Also NDArray has this interface in 1D.
```
  090a8f29
- Added Devices::Sequential to DistributedVectorTest and VectorTestSetup · 045241d7
  Jakub Klinkovský authored Jul 10, 2021
  
  045241d7
- Added Devices::Sequential to ArrayTest and ArrayViewTest · 1c9ff705
  Jakub Klinkovský authored Jul 10, 2021
  
  1c9ff705
- Removed useless SharedPointer and ParallelFor from ArrayTest · 62100711
  Jakub Klinkovský authored Jul 09, 2021
```
The tests should not rely on other parts of the library if possible.
```
  62100711
- Refactored splitting of the scan operation in two phases · 311fcf36
  Jakub Klinkovský authored Jul 10, 2021
```
- sequential scan does not need to be split, so "perform" performs the
  whole simple scan algorithm, "performFirstPhase" only reduces the
  block (i.e. the whole vector), "performSecondPhase" performs the scan
  operation with the block result combined with a global offset as the
  initial value
- parallel OpenMP scan calls the sequential scan to process the block
  results
- parallel CUDA scan was changed such that the block results array is an
  exclusive scan after the first phase, same as in the other device
  specializations
```
  311fcf36
- Fixed sequential scan to apply the initial value properly · ee8e4e92
  Jakub Klinkovský authored Jul 10, 2021
  
  ee8e4e92