NDArray
Features
-
write tests -
implement SlicedNDArray, generalize the slicing calculations from multimaps -
implement static NDArray (avoid dynamic allocations when all dimensions are static) -
implement low-dimensional accessors into the subspaces of the NDArray - template arguments specify the dimensions contained in the subarray
- method arguments specify the multiindex of the first subarray element (it cannot be in the interior of the array)
- dimension contraction with ranges like
[begin, end)
is not supported - access-intent attributes? (see Kokkos presentation) (useful for e.g. texture-cache loads of const data, but that's probably useless on Pascal and newer GPUs)
-
move into TNL (C++14 is required for better constexpr functions and generic lambdas; nvcc bug: extended lambdas cannot be generic) -
distributed version with MPI - only dynamic sizes can be distributed
- only 1D/2D/3D distribution (i.e. only up to 3 dimensions can be distributed)
- support overlaps (ghost cells) and their synchronization
- support periodic overlaps
- support different overlap size in different directions
- distributed ndarray cannot return a view to the local data -
LocalArrayView
does not make sense due to overlaps - iterators:
forAll
,forInternal
,forBoundary
,forLocalInternal
,forLocalBoundary
,forOverlaps
(either all or just some directions) - support asynchronous synchronization of overlaps (custom CUDA streams or per-thread default streams will be necessary)
-
implement generic assignment operator -
support any value, device and index types -
support any permutation -
support copies to and from non-contiguous memory (e.g. subarrays)
-
-
add support for different allocators (c.f. Array
implementation)
Applications
-
storage of NumDwarf coefficients in tnl-mhfem
(that was the main driving force for the development) -
storage for VectorField
in TNL -
storage of distribution functions and macroscopic quantities in LBM -
subarrays: writing 1D and 2D slices into VTK -
SlicedNDArray
might improve data locality in CWYGMRES operations:- prefetching cannot cross page boundaries (see meltdown.pdf)
- NUMA systems
-
replace StaticMatrix
intnl-mhfem
with an alias to static 2-dimensional NDArray
Operations
-
generic traversers for scalar functions (assign, add, subtract, multiply, ...) -
forAll
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes)
multiindex range -
forInternal
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[begin, end)
multiindex range (by default,begin == 1
andend == sizes - 1
) -
forBoundary
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes) - [begin, end)
multiindex set (by default,begin == 1
andend == sizes - 1
); parallel version works only in 1D, 2D, and 3D
-
-
reduce_along_axis
,reduce_along_axes
- generalized multireductions - see also https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default&fileviewer=file-view-default#markdown-header-reduction-operations -
apply_along_axis
- apply a function to all 1D slices along given axis (challenge: parallelization of outer or inner function)- Note that unlike numpy.apply_along_axis, the inner function cannot change the array dimension/shape.
- Note that the similar NumPy function, apply_over_axes, is not applicable to NDArray because the slices along different axes have different type so a single function cannot be applied to them. Also, even in NumPy it is interesting only with the change of dimension/shape.
-
reordering of ND arrays along any axis (currently hardcoded in tnl-mhfem
only for one specific layout of dofs) -
other NumPy array manipulation routines - logical re-shaping and transpose-like operations (i.e. return a view with different sizes or permutation of axes without changing the data) -
Eigen geometrical operations on tensors
Benchmarks
-
compilation time depending on the number of dimensions, number of ndarrays in code, ... -
overhead of the indexing calculation for high-dimensional array -
operations -
comparison with RAJA identity perm, set: 1D bandwidth: 9.4718 GB/s, 6D bandwidth: 8.52481 GB/s (9% difference) identity perm, assign: 1D bandwidth: 11.503 GB/s, 6D bandwidth: 11.0063 GB/s (4.5% loss compared to 1D) reverse perm, assign: 6D bandwidth: 9.58735 GB/s (13% loss compared to identity 6D)
-
comparison with OpenFOAM - ScalarField
,VectorField
,TensorField
, operations like tensor*vector (locally on mesh cells)
References
- AoS vs SoA vs SoAoS: some LBM paper
- chapters 21 and 31 in the GPU Computing Gems
- P0009: Polymorphic Multidimensional Array
- https://scicomp.stackexchange.com/questions/3130/how-to-efficiently-structure-simulation-data-in-memory-for-cells-with-varying-de
Edited by Jakub Klinkovský