NDArray
Features
 write tests
 implement SlicedNDArray, generalize the slicing calculations from multimaps
 implement static NDArray (avoid dynamic allocations when all dimensions are static)

implement lowdimensional accessors into the subspaces of the NDArray
 template arguments specify the dimensions contained in the subarray
 method arguments specify the multiindex of the first subarray element (it cannot be in the interior of the array)
 dimension contraction with ranges like
[begin, end)
is not supported  accessintent attributes? (see Kokkos presentation) (useful for e.g. texturecache loads of const data, but that's probably useless on Pascal and newer GPUs)
 move into TNL (C++14 is required for better constexpr functions and generic lambdas; nvcc bug: extended lambdas cannot be generic)

distributed version with MPI
 only dynamic sizes can be distributed
 only 1D/2D/3D distribution (i.e. only up to 3 dimensions can be distributed)
 support overlaps (ghost cells) and their synchronization
 support periodic overlaps
 support different overlap size in different directions
 distributed ndarray cannot return a view to the local data 
LocalArrayView
does not make sense due to overlaps  iterators:
forAll
,forInternal
,forBoundary
,forLocalInternal
,forLocalBoundary
,forOverlaps
(either all or just some directions)  support asynchronous synchronization of overlaps (custom CUDA streams or perthread default streams will be necessary)

implement generic assignment operator
 support any value, device and index types
 support any permutation
 support copies to and from noncontiguous memory (e.g. subarrays)

add support for different allocators (c.f.
Array
implementation)
Applications

storage of NumDwarf coefficients in
tnlmhfem
(that was the main driving force for the development) 
storage for
VectorField
in TNL  storage of distribution functions and macroscopic quantities in LBM
 subarrays: writing 1D and 2D slices into VTK

SlicedNDArray
might improve data locality in CWYGMRES operations: prefetching cannot cross page boundaries (see meltdown.pdf)
 NUMA systems

replace
StaticMatrix
intnlmhfem
with an alias to static 2dimensional NDArray
Operations

generic traversers for scalar functions (assign, add, subtract, multiply, ...)

forAll
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes)
multiindex range 
forInternal
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[begin, end)
multiindex range (by default,begin == 1
andend == sizes  1
) 
forBoundary
method: calls a lambda functionf
asf(i1, i2, ..., iN)
, where the indices are taken from the[0, sizes)  [begin, end)
multiindex set (by default,begin == 1
andend == sizes  1
); parallel version works only in 1D, 2D, and 3D


reduce_along_axis
,reduce_along_axes
 generalized multireductions  see also https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default&fileviewer=fileviewdefault#markdownheaderreductionoperations 
apply_along_axis
 apply a function to all 1D slices along given axis (challenge: parallelization of outer or inner function) Note that unlike numpy.apply_along_axis, the inner function cannot change the array dimension/shape.
 Note that the similar NumPy function, apply_over_axes, is not applicable to NDArray because the slices along different axes have different type so a single function cannot be applied to them. Also, even in NumPy it is interesting only with the change of dimension/shape.

reordering of ND arrays along any axis (currently hardcoded in
tnlmhfem
only for one specific layout of dofs)  other NumPy array manipulation routines  logical reshaping and transposelike operations (i.e. return a view with different sizes or permutation of axes without changing the data)
 Eigen geometrical operations on tensors
Benchmarks

compilation time depending on the number of dimensions, number of ndarrays in code, ...

overhead of the indexing calculation for highdimensional array

operations

comparison with RAJA
identity perm, set: 1D bandwidth: 9.4718 GB/s, 6D bandwidth: 8.52481 GB/s (9% difference) identity perm, assign: 1D bandwidth: 11.503 GB/s, 6D bandwidth: 11.0063 GB/s (4.5% loss compared to 1D) reverse perm, assign: 6D bandwidth: 9.58735 GB/s (13% loss compared to identity 6D)

comparison with OpenFOAM 
ScalarField
,VectorField
,TensorField
, operations like tensor*vector (locally on mesh cells)
References
 AoS vs SoA vs SoAoS: some LBM paper
 chapters 21 and 31 in the GPU Computing Gems
 P0009: Polymorphic Multidimensional Array
 https://scicomp.stackexchange.com/questions/3130/howtoefficientlystructuresimulationdatainmemoryforcellswithvaryingde