Commits · 5cf6fd999f1223d4a3c402c81f8ea7009ccf387c · TNL / tnl-dev

Sep 02, 2021
- Fixed setSize method in String · 5cf6fd99
  Jakub Klinkovský authored Sep 02, 2021
  
  5cf6fd99
- Improved asserts in String · e3cf68ad
  Jakub Klinkovský authored Sep 02, 2021
  
  e3cf68ad
Sep 01, 2021
- Fixed hardcoded int type in MatrixOperations.h · ea03a3ae
  Jakub Klinkovský authored Sep 01, 2021
  
  ea03a3ae
- Fixed the install script to work with RelWithDebInfo build type · 25be128c
  Jakub Klinkovský authored Sep 01, 2021
  
  25be128c
Aug 31, 2021
- Fixed header in Functional.h · b5b71869
  Jakub Klinkovský authored Aug 31, 2021
  
  b5b71869
Aug 27, 2021
- Workaround for a bug in GCC 11.1.0 · ce5ed269
  Jakub Klinkovský authored Aug 27, 2021
```
Amends e5fc6a96
```
  ce5ed269
- Fixed compiler warning due to delete on malloc'ed memory · 58c4783a
  Jakub Klinkovský authored Aug 27, 2021
  
  58c4783a
- Fixed gemv and geam calls in BICGStabL and GMRES · 123457c7
  Jakub Klinkovský authored Aug 27, 2021
  
  123457c7
Aug 11, 2021
- Merge branch 'JK/scan' into 'develop' · b3a8feac
  Jakub Klinkovský authored Aug 11, 2021
```
Scan refactoring

Closes #87

See merge request !100
```
  b3a8feac
Aug 08, 2021
- Added modulo assignment operator to StaticVector, Vector, DistributedVector and their views · d9af4a61
  Jakub Klinkovský authored Aug 08, 2021
  
  d9af4a61
- Changed runtime exceptions in the comparison of vector expressions to static asserts · cfe19eb8
  Jakub Klinkovský authored Aug 08, 2021
  
  cfe19eb8
- Simplified cast operator in StaticArray and StaticVector · d94899f0
  Jakub Klinkovský authored Aug 08, 2021
  
  d94899f0
Aug 06, 2021
- Refactored horizontal operations in expression templates using function objects · 79a3009d
  Jakub Klinkovský authored Aug 06, 2021
```
- structs from HorizontalOperations.h reimplemented as function objects in
  Functional.h
- repetitive function definitions generated using macros
- added new operators: % (modulus) and ^ (xor)
```
  79a3009d
- Added tests of expression templates with CustomScalar · 4f2c112a
  Jakub Klinkovský authored Aug 05, 2021
  
  4f2c112a
- Added tests of the reduction and scan algorithm with CustomScalar · 1433c746
  Jakub Klinkovský authored Aug 02, 2021
```
This way we test both the general CUDA implementation using shared
memory and the specialization using __shfl instructions.

Both the reduction and scan kernels needed some tweaks due to shared
memory usage with non-fundamental types.
```
  1433c746
- Added IsScalarType trait for the detection of scalar types in vector expressions · 2d454b15
  Jakub Klinkovský authored Aug 06, 2021
```
This is needed because custom specializations of std::is_arithmetic
cannot be used (they cause an undefined behaviour).
```
  2d454b15
- Simplified type traits in Math.h, removed pre-C++14 code · d8e38db3
  Jakub Klinkovský authored Aug 06, 2021
  
  d8e38db3
- Moved HasStaticGetSerializationType from TypeTraits.h into TypeInfo.h · 71a5b208
  Jakub Klinkovský authored Aug 06, 2021
  
  71a5b208
Aug 03, 2021
- Removed containsValue and containsOnlyValue from Array and ArrayView · 71ac38b2
  Jakub Klinkovský authored Aug 02, 2021
```
The algorithms are implemented as plain functions in TNL::Algorithms.
containsValue was replaced with contains.
```
  71ac38b2
Jul 31, 2021
- Removed old workaround for nvcc from CudaReductionKernel · 0d2a5781
  Jakub Klinkovský authored Jul 31, 2021
  
  0d2a5781
- Refactored reduction and scan in quicksort implementation · f0926de3
  Jakub Klinkovský authored Jul 28, 2021
  
  f0926de3
- Added specialization of CudaBlockScan using __shfl instructions · 1a64a618
  Jakub Klinkovský authored Jul 29, 2021
  
  1a64a618
- Added specialization of CudaBlockReduce using __shfl instructions · 4bbf495c
  Jakub Klinkovský authored Jul 29, 2021
  
  4bbf495c
- Removed useless vertical operations and used RemoveET in reduce.h · 05903a8f
  Jakub Klinkovský authored Jul 24, 2021
  
  05903a8f
- Reduction: renamed zero and idempotent to identity · 0d329226
  Jakub Klinkovský authored Jul 23, 2021
  
  0d329226
- Removed reduction methods from Array and ArrayView, instead added overloads of... · 9ce0b169
  Jakub Klinkovský authored Jul 23, 2021
```
Removed reduction methods from Array and ArrayView, instead added overloads of reduce and reduceWithArgument for arrays/views

Plain functions are much more flexible than methods. The methods were
also violating the open-closed principle:
https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle
```
  9ce0b169
- Added definition of ValueType to expression templates · a4bddfff
  Jakub Klinkovský authored Jul 23, 2021
  
  a4bddfff
- Added default reduction functional (TNL::Plus) to reduce · d97cea88
  Jakub Klinkovský authored Jul 23, 2021
  
  d97cea88
- Optimized parallel OpenMP scan algorithm for expensive inputs · 4743a565
  Jakub Klinkovský authored Jul 22, 2021
```
This adds back the original approach (prescan + uniform shift) which
was removed too early.
```
  4743a565
- Added more scan operations to the BLAS benchmark · 8f8c301b
  Jakub Klinkovský authored Jul 22, 2021
  
  8f8c301b
- CudaScanKernelLauncher: configuration of blockSize depending on the ValueType · 57d66051
  Jakub Klinkovský authored Jul 22, 2021
  
  57d66051
- Optimized parallel CUDA scan algorithm to avoid unnecessary writing in the first phase · 8accbc52
  Jakub Klinkovský authored Jul 22, 2021
```
The original approach (prescan + uniform shift) is more efficient for
inputs that are expensive to evaluate, such as vector expressions.
```
  8accbc52
- Refactored CudaScanKernelFirstPhase - split out CudaBlockScan and CudaTileScan · 2f61104b
  Jakub Klinkovský authored Jul 20, 2021
  
  2f61104b
- Refactored CudaReductionKernel - split the parallel reduction into CudaBlockReduce · 3bd8fff5
  Jakub Klinkovský authored Jul 20, 2021
  
  3bd8fff5
- Optimized upper bound for the scan of warpResults in the CUDA parallel scan · addb7566
  Jakub Klinkovský authored Jul 19, 2021
  
  addb7566
- Refactored CUDA parallel scan kernel · 429bd511
  Jakub Klinkovský authored Jul 19, 2021
```
Using an odd number of valuesPerThread avoids shared memory bank
conflicts even without a special interleaving. We also save some shared
memory this way.

Small inputs can be scanned with just one CUDA block, which avoids the
scan of block results and second-phase kernel. Hence, large arrays can
be scanned with just 3 kernel launches instead of 4.
```
  429bd511
- Refactored CUDA parallel scan kernel · c37987b1
  Jakub Klinkovský authored Jul 18, 2021
```
The input values are first copied into shared memory, reduced
sequentially across chunks, and scanned only at the end of the kernel.
This follows the upsweep-downsweep approach by Blelloch which is more
work-efficient. Also the distinction between exclusive and inclusive
scan appears only at the end of the kernel, which avoids the weird "+2"
size of the shared memory.

Also used Cuda::getInterleaving() for the indices when accessing the
chunkResults array, which avoids shared memory banks conflicts in the
spine-scan phase.
```
  c37987b1
- Refactored CUDA parallel scan kernel · 7a688833
  Jakub Klinkovský authored Jul 17, 2021
```
- input and output are passed by views rather than raw pointers (this
  allows to scan even vector expressions)
- consequently, indexing is different (begin and end for the global
  memory accesses)
- fixed calculation of currentSize in the launcher
- changed configuration of the kernel using the blockSize and
  valuesPerThread template parameters rather than the elementsInBlock
  runtime parameter
- changed allocation of the shared memory from dynamic to static
- the second phase kernel uses shared memory to cache block results for
  each block
```
  7a688833
- Implemented 'outplace' variants of scan and distributed scan functions · 76a95d0b
  Jakub Klinkovský authored Jul 14, 2021
  
  76a95d0b
- Refactored and extended tests for scan and distributed scan · 4f1dc3af
  Jakub Klinkovský authored Jul 13, 2021
  
  4f1dc3af