NDArray
Features

write tests 
implement SlicedNDArray, generalize the slicing calculations from multimaps 
implement static NDArray (avoid dynamic allocations when all dimensions are static) 
implement lowdimensional accessors into the subspaces of the NDArray  template arguments specify the dimensions contained in the subarray
 method arguments specify the multiindex of the first subarray element (it cannot be in the interior of the array)
 dimension contraction with ranges like
[begin, end)
is not supported  accessintent attributes? (see Kokkos presentation) (useful for e.g. texturecache loads of const data, but that's probably useless on Pascal and newer GPUs)

move into TNL (C++14 is required for better constexpr functions and generic lambdas; nvcc bug: extended lambdas cannot be generic) 
distributed version with MPI  only dynamic sizes can be distributed
 only 1D/2D/3D distribution (i.e. only up to 3 dimensions can be distributed)
 support overlaps (ghost cells) and their synchronization
 support periodic overlaps
 support different overlap size in different directions
 distributed ndarray cannot return a view to the local data 
LocalArrayView
does not make sense due to overlaps  iterators:
forAll
,forInternal
,forBoundary
,forLocalInternal
,forLocalBoundary
,forOverlaps
(either all or just some directions)  support asynchronous synchronization of overlaps (custom CUDA streams or perthread default streams will be necessary)

implement generic assignment operator 
support any value, device and index types 
support any permutation 
support copies to and from noncontiguous memory (e.g. subarrays)


add support for different allocators (c.f. Array
implementation)
Applications

storage of NumDwarf coefficients in tnlmhfem
(that was the main driving force for the development) 
storage for VectorField
in TNL 
storage of distribution functions and macroscopic quantities in LBM 
subarrays: writing 1D and 2D slices into VTK 
SlicedNDArray
might improve data locality in CWYGMRES operations: prefetching cannot cross page boundaries (see meltdown.pdf)
 NUMA systems

replace StaticMatrix
intnlmhfem
with an alias to static 2dimensional NDArray
Operations

generic traversers for scalar functions (assign, add, subtract, multiply, ...) 
forAll
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes)
multiindex range 
forInternal
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[begin, end)
multiindex range (by default,begin == 1
andend == sizes  1
) 
forBoundary
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes)  [begin, end)
multiindex set (by default,begin == 1
andend == sizes  1
); parallel version works only in 1D, 2D, and 3D


reduce_along_axis
,reduce_along_axes
 generalized multireductions  see also https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default&fileviewer=fileviewdefault#markdownheaderreductionoperations 
apply_along_axis
 apply a function to all 1D slices along given axis (challenge: parallelization of outer or inner function) Note that unlike numpy.apply_along_axis, the inner function cannot change the array dimension/shape.
 Note that the similar NumPy function, apply_over_axes, is not applicable to NDArray because the slices along different axes have different type so a single function cannot be applied to them. Also, even in NumPy it is interesting only with the change of dimension/shape.

reordering of ND arrays along any axis (currently hardcoded in tnlmhfem
only for one specific layout of dofs) 
other NumPy array manipulation routines  logical reshaping and transposelike operations (i.e. return a view with different sizes or permutation of axes without changing the data) 
Eigen geometrical operations on tensors
Benchmarks

compilation time depending on the number of dimensions, number of ndarrays in code, ... 
overhead of the indexing calculation for highdimensional array 
operations 
comparison with RAJA identity perm, set: 1D bandwidth: 9.4718 GB/s, 6D bandwidth: 8.52481 GB/s (9% difference) identity perm, assign: 1D bandwidth: 11.503 GB/s, 6D bandwidth: 11.0063 GB/s (4.5% loss compared to 1D) reverse perm, assign: 6D bandwidth: 9.58735 GB/s (13% loss compared to identity 6D)

comparison with OpenFOAM  ScalarField
,VectorField
,TensorField
, operations like tensor*vector (locally on mesh cells)
References
 AoS vs SoA vs SoAoS: some LBM paper
 chapters 21 and 31 in the GPU Computing Gems
 P0009: Polymorphic Multidimensional Array
 https://scicomp.stackexchange.com/questions/3130/howtoefficientlystructuresimulationdatainmemoryforcellswithvaryingde
Edited by Jakub Klinkovský