Skip to content

NDArray

Jakub Klinkovský requested to merge ndarray into develop

Features

  • write tests
  • implement SlicedNDArray, generalize the slicing calculations from multimaps
  • implement static NDArray (avoid dynamic allocations when all dimensions are static)
  • implement low-dimensional accessors into the subspaces of the NDArray
    • template arguments specify the dimensions contained in the subarray
    • method arguments specify the multiindex of the first subarray element (it cannot be in the interior of the array)
    • dimension contraction with ranges like [begin, end) is not supported
    • access-intent attributes? (see Kokkos presentation) (useful for e.g. texture-cache loads of const data, but that's probably useless on Pascal and newer GPUs)
  • move into TNL (C++14 is required for better constexpr functions and generic lambdas; nvcc bug: extended lambdas cannot be generic)
  • distributed version with MPI
    • only dynamic sizes can be distributed
    • only 1D/2D/3D distribution (i.e. only up to 3 dimensions can be distributed)
    • support overlaps (ghost cells) and their synchronization
    • support periodic overlaps
    • support different overlap size in different directions
    • distributed ndarray cannot return a view to the local data - LocalArrayView does not make sense due to overlaps
    • iterators: forAll, forInternal, forBoundary, forLocalInternal, forLocalBoundary, forOverlaps (either all or just some directions)
    • support asynchronous synchronization of overlaps (custom CUDA streams or per-thread default streams will be necessary)
  • implement generic assignment operator
    • support any value, device and index types
    • support any permutation
    • support copies to and from non-contiguous memory (e.g. subarrays)
  • add support for different allocators (c.f. Array implementation)

Applications

  • storage of NumDwarf coefficients in tnl-mhfem (that was the main driving force for the development)
  • storage for VectorField in TNL
  • storage of distribution functions and macroscopic quantities in LBM
  • subarrays: writing 1D and 2D slices into VTK
  • SlicedNDArray might improve data locality in CWYGMRES operations:
    • prefetching cannot cross page boundaries (see meltdown.pdf)
    • NUMA systems
  • replace StaticMatrix in tnl-mhfem with an alias to static 2-dimensional NDArray

Operations

  • generic traversers for scalar functions (assign, add, subtract, multiply, ...)
    • forAll method: calls a lambda function f as f(i1, i2, ..., iN), where the indices are taken from the [0, sizes) multiindex range
    • forInternal method: calls a lambda function f as f(i1, i2, ..., iN), where the indices are taken from the [begin, end) multiindex range (by default, begin == 1 and end == sizes - 1)
    • forBoundary method: calls a lambda function f as f(i1, i2, ..., iN), where the indices are taken from the [0, sizes) - [begin, end) multiindex set (by default, begin == 1 and end == sizes - 1); parallel version works only in 1D, 2D, and 3D
  • reduce_along_axis, reduce_along_axes - generalized multireductions - see also https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?at=default&fileviewer=file-view-default#markdown-header-reduction-operations
  • apply_along_axis - apply a function to all 1D slices along given axis (challenge: parallelization of outer or inner function)
    • Note that unlike numpy.apply_along_axis, the inner function cannot change the array dimension/shape.
    • Note that the similar NumPy function, apply_over_axes, is not applicable to NDArray because the slices along different axes have different type so a single function cannot be applied to them. Also, even in NumPy it is interesting only with the change of dimension/shape.
  • reordering of ND arrays along any axis (currently hardcoded in tnl-mhfem only for one specific layout of dofs)
  • other NumPy array manipulation routines - logical re-shaping and transpose-like operations (i.e. return a view with different sizes or permutation of axes without changing the data)
  • Eigen geometrical operations on tensors

Benchmarks

  • compilation time depending on the number of dimensions, number of ndarrays in code, ...

  • overhead of the indexing calculation for high-dimensional array

  • operations

  • comparison with RAJA

    identity perm, set:     1D bandwidth: 9.4718 GB/s,  6D bandwidth: 8.52481 GB/s  (9% difference)
    identity perm, assign:  1D bandwidth: 11.503 GB/s,  6D bandwidth: 11.0063 GB/s  (4.5% loss compared to 1D)
    reverse perm, assign:                               6D bandwidth: 9.58735 GB/s  (13% loss compared to identity 6D)
  • comparison with OpenFOAM - ScalarField, VectorField, TensorField, operations like tensor*vector (locally on mesh cells)

References

Edited by Jakub Klinkovský

Merge request reports