@@ -102,19 +102,59 @@ Common operations may be classified as follows:
\end{itemize}
Additionally, complex algorithms may involve data exchange prior to performing the local operation (such as the input vector re-distribution in the distributed matrix--vector multiplication).
General description of such algorithms is out of scope of this work.
In the following subsection, we focus on data exchange in stencil computations, which are common in numerical modeling.
In the following subsection, we focus on data exchange in stencil computations, which are common in numerical analysis.
\subsection{Overlaps and data synchronization}
\inline{Introduce the concept of overlaps}
\inline{Describe the synchronizer}
Stencils in numerical analysis are geometric arrangements of nodes on a structured grid that express which data is accessed from a grid node of interest and used in the numerical computation.
The application of stencils on the data stored in a non-distributed multidimensional array is straightforward as all data is readily available in the memory of one rank.
However, in a distributed configuration, the multidimensional array needs to be extended with \emph{overlaps} to facilitate access to data owned by neighboring ranks, and \emph{data synchronization} to exchange data in the overlapping regions of the array when the stencil is applied iteratively.
\Cref{fig:ndarray decomposition overlaps} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps highlighted using dotted lines.
The width of the overlaps in the TNL implementation is configurable separately for each dimension of the array.
I.e., the overlaps can be described by a multi-index \ic{overlaps} where each component is non-negative and determines the number of data elements along the corresponding axis shared between neighboring ranks.
Each rank includes the overlapping regions in its local array which is allocated larger, i.e., it spans the multidimensional interval $[\ic{localBegins}-\ic{overlaps}, \ic{localEnds}+\ic{overlaps})$.
Note that negative indices may be used to access the local data in the overlapping region below \ic{localBegins} and they are mapped properly to the one-dimensional storage index.
Also note that the array always includes the so called \emph{periodic overlaps} in the allocation, i.e., overlapping regions around the whole array.
This allows to easily implement the periodic boundary conditions in numerical schemes as it ensures that the stencil can be applied on all local nodes.
\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps between neighbors.}
\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps between neighbors (highlighted using dotted lines).}
\label{fig:ndarray decomposition overlaps}
\end{figure}
The data synchronization for the overlaps in distributed multidimensional arrays in TNL is implemented in the class \ic{TNL::Containers::DistributedNDArraySynchronizer}.
Before the algorithm can be started, each MPI rank must configure the synchronizer:
\begin{itemize}
\item
The rank IDs of the \emph{neighbors} relevant for the stencil must be set.
In the example shown in \cref{fig:ndarray decomposition overlaps}, rank 4 would set ranks 1, 3, 5, and 7 as its neighbors for the five-point stencil, and ranks 0, 1, 2, 3, 5, 6, 7, and 8 for the nine-point stencil.
\item
The \emph{directions} in which the data can be transferred must be set depending on the stencil.
In the simplest case, all values of the array can be transferred in all directions, but some applications (such as the lattice Boltzmann method described in \cref{sec:LBM}) may use separate arrays for different synchronization directions (e.g., left-to-right or right-to-left).
\item
The \emph{tags} for MPI messages can be set to avoid conflicts when multiple arrays distributed among the same ranks are synchronized at the same time.
\end{itemize}
The synchronization algorithm can be executed in several modes.
The asynchronous or non-blocking mode allows to interleave the synchronization with some other, unrelated work by suspending the execution when all non-blocking MPI requests have been created and deferring the remaining steps for later.
Assuming that the MPI implementation can proceed with the communication in the background, this approach can greatly improve the efficiency of the distributed algorithm.
The synchronization procedure consists of the following general steps:
\begin{enumerate}
\item Allocate all send and receive buffers.
\item Copy data from the local array to the send buffers.
\item Start MPI communications with all neighbors via \ic{MPI_Isend} and \ic{MPI_Irecv}.
\item Return a vector of MPI requests (the non-blocking procedure is suspended after this step).
\item Wait for all MPI requests to complete.
\item Copy data from the receive buffers to the local array.
\end{enumerate}
Some of the aforementioned steps are not always necessary.
For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.
In general, to avoid the number of MPI requests, the data to be sent must be copied into a contiguous buffer and the received data must be copied from a contiguous buffer into the local array.
In some special cases depending on the synchronization directions and the layout of the array, it can be ensured that all data to be sent or received is already stored in a contiguous block of the array, allowing the copies to be omitted by replacing the buffers with views into the local array itself.
These cases are difficult to detect automatically and must be configured manually by the user.
\section{Sparse matrices}
\label{sec:sparse matrices}
\inline{Supporting paper: "Segments: Data abstraction for parallel computations on sparse data"}