Commit 2e7d14ad authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

revised section on multidimensional array based on recent LBM development

parent ba15ade4
Loading
Loading
Loading
Loading
Loading
+25 −14
Original line number Diff line number Diff line
@@ -19,6 +19,8 @@ In the following subsections, we describe the \emph{distributed} version of the
In the \emph{distributed} configuration, a global multidimensional array is decomposed into several subarrays and each MPI rank typically stores the data associated to one subarray.
Since a multidimensional array stores structured data, we will consider only \emph{structured conforming} decompositions, where the array is split by hyperplanes perpendicular to one of the axes.
\Cref{fig:ndarray decomposition} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays.
Note that the implementation also allows multiple blocks to be assigned to the same rank, which can be seen as a generalization of the requirement for conforming decompositions.
This feature is useful for optimizing the decomposition under various constraints, which will be described later in the following chapters.

\begin{figure}[bt]
    \centering
@@ -84,27 +86,36 @@ Before the algorithm can be started, each MPI rank must configure the synchroniz
    \item
        The rank IDs of the \emph{neighbors} relevant for the stencil must be set.
        In the example shown in \cref{fig:ndarray decomposition overlaps}, rank 4 would set ranks 1, 3, 5, and 7 as its neighbors for the five-point stencil, and ranks 0, 1, 2, 3, 5, 6, 7, and 8 for the nine-point stencil.
        Note that for simplicity, the rank numbering in \cref{fig:ndarray decomposition overlaps} is structured, but the synchronizer supports an arbitrary unstructured numbering.
    \item
        The \emph{directions} in which the data can be transferred must be set depending on the stencil.
        In the simplest case, all values of the array can be transferred in all directions, but some applications (such as the lattice Boltzmann method described in \cref{sec:LBM}) may use separate arrays for different synchronization directions (e.g., left-to-right or right-to-left).
    \item
        The \emph{tags} for MPI messages can be set to avoid conflicts when multiple arrays distributed among the same ranks are synchronized at the same time.
\end{itemize}
The synchronization algorithm can be executed in several modes.
The asynchronous or non-blocking mode allows to interleave the synchronization with some other, unrelated work by suspending the execution when all non-blocking MPI requests have been created and deferring the remaining steps for later.
Assuming that the MPI implementation can proceed with the communication in the background, this approach can greatly improve the efficiency of the distributed algorithm.
The synchronization procedure consists of the following general steps:
\begin{enumerate}

The synchronization procedure consists of the steps summarized in \cref{alg:distributed ndarray synchronization}.
Note that some of the steps in the algorithm are not always necessary.
For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.

\begin{algorithm}[Distributed multidimensional array synchronization]
    \label{alg:distributed ndarray synchronization}
    \begin{algsteps}
        \item Allocate all send and receive buffers.
        \item Copy data from the local array to the send buffers.
        \item Start MPI communications with all neighbors via \ic{MPI_Isend} and \ic{MPI_Irecv}.
        \item Return a vector of MPI requests (the non-blocking procedure is suspended after this step).
        \item Wait for all MPI requests to complete.
        \item Copy data from the receive buffers to the local array.
\end{enumerate}
Some of the aforementioned steps are not always necessary.
For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.
    \end{algsteps}
\end{algorithm}

The synchronization algorithm can be executed in several modes.
The asynchronous or non-blocking mode allows to interleave the synchronization with some other, unrelated work by suspending the execution when all non-blocking MPI requests have been created and deferring the remaining steps for later.
Assuming that the MPI implementation can proceed with the communication in the background, this approach can greatly improve the efficiency of the distributed algorithm.
When multiple arrays are to be synchronized at the same time among the ranks, it is also desirable to interleave the individual steps of \cref{alg:distributed ndarray synchronization} via \emph{pipelining} for all arrays.

In general, to avoid the number of MPI requests, the data to be sent must be copied into a contiguous buffer and the received data must be copied from a contiguous buffer into the local array.
In some special cases depending on the synchronization directions and the layout of the array, it can be ensured that all data to be sent or received is already stored in a contiguous block of the array, allowing the copies to be omitted by replacing the buffers with views into the local array itself.
These cases are difficult to detect automatically and must be configured manually by the user.
In general, to avoid a large number of small MPI requests, the data to be sent must be copied into a contiguous buffer and the received data must be copied from a contiguous buffer into the local array.
On the other hand, if the data to be sent or received is already stored in a contiguous block of the array, these copies are useless and therefore should be omitted.
Such cases are automatically detected in the \ic{DistributedNDArraySynchronizer} class and buffers are replaced with views into the local array itself.
Hence, users may configure the layout of the array appropriately for the synchronization directions in their application in order to improve the efficiency of the synchronization algorithm.
 No newline at end of file