@@ -329,7 +329,7 @@ This approach is summarized below in \cref{alg:distributed LBM}.
\item Parallel initialization for all $\vec x \in\overline{\hat\Omega}_j$, excluding the ghost nodes (step \ref{algitem:initialization} in \cref{alg:LBM}).
Wait until the operation started in step \ref{algitem:LBM-overlapping:interior start} is finished.
\end{algsteps}
Wait until the operations started in step \ref{algitem:LBM-overlapping:interior start} are finished for all subdomains.
\item Go to the next time step (step \ref{algitem:nextiter} in \cref{alg:LBM}).
\end{algsteps}
\end{algsteps}
@@ -467,33 +467,89 @@ The modifications of the components such as \ic{BC} or \ic{MACRO} may need to be
\subsection{Optimization remarks}
\label{sec:lbm:optimization}
\inline{permutation of the indices in the arrays}
The most important optimization for distributed LBM computations has already been outlined in \cref{alg:distributed LBM with overlapping}: processing interior lattice sites must be overlapped with data synchronization on the interfaces between neighboring subdomains.
This technique is implemented by launching multiple CUDA kernels in different CUDA streams and using stream-level synchronization functions \cite{nvidia:cuda}.
A separate CUDA stream is used for each interface with a neighboring subdomain and for the interior lattice sites.
The streams for interfaces are configured with the highest priority so that the operations in these streams are executed as soon as possible.
The stream for the interior is configured with the lowest priority as it is used for the longest-running kernel that is waited for in the end of the time step.
\inline{MPI implementation: CUDA streams for overlapping computation and communication, NDArraySynchronizer, avoiding buffering for contiguous data ranges, pipelining, CUDA streams for copy kernels when buffering is needed}
Following the processing of lattice sites at the interfaces, data synchronization among the neighboring ranks starts.
An object of the \ic{DistributedNDArraySynchronizer} class that was described in \cref{sec:distributed ndarray synchronization} is created for each subdomain and each of the $Q$ discrete distribution functions.
Each synchronizer object is responsible for synchronizing one subarray to and from the neighbors in the direction of the corresponding vector from the velocity set.
For example, when the D3Q27 velocity set is used and the lattice is distributed in a 1D manner such that each subdomain has at most two neighbors, one to the left and one to the right, 9 distribution functions (corresponding to discrete velocities with $+1$ in the first component) are synchronized from left to right on each interface, and different 9 distribution functions (corresponding to discrete velocities with $-1$ in the first component) are synchronized from right to left.
Note that in case of a 2D or 3D decomposition, the communication pattern is more complicated as it is necessary to also synchronize relevant distribution functions across edges and corners of the interfaces between subdomains.
\inline{execution: CUDA thread block size}
The synchronizer objects are used in \cref{alg:distributed LBM with overlapping} in the steps \ref{algitem:LBM-overlapping:communication 1} and \ref{algitem:LBM-overlapping:communication 2}.
The synchronization procedure corresponds to \cref{alg:distributed ndarray synchronization} with \emph{pipelining} among all synchronizers as mentioned in \cref{sec:distributed ndarray synchronization}.
More formally, the pipelined synchronization as executed by one MPI rank is summarized in \cref{alg:pipelined ndarray synchronization}.
\inline{multidimensional domain decomposition?}
\begin{algorithm}[Pipelined synchronization of distributed multidimensional arrays]
\label{alg:pipelined ndarray synchronization}
\begin{algsteps}
\item\label{algitem:ndarray pipelining:step 1}
For all synchronizers of the current MPI rank:
\begin{itemize}
\item Allocate all send and receive buffers (relevant only in the first iteration).
\item Start copying data from the local array to the send buffers.
\end{itemize}
\item\label{algitem:ndarray pipelining:step 2}
For all synchronizers of the current MPI rank:
\begin{itemize}
\item Wait until all data is copied to the send buffers.
\item Start MPI communications with all neighbors via \ic{MPI_Isend} and \ic{MPI_Irecv}.
\end{itemize}
\item\label{algitem:ndarray pipelining:step 3}
For all synchronizers of the current MPI rank:
\begin{itemize}
\item Wait for all MPI communications the neighbors to complete.
\item Start copying data from the receive buffers to the local array.
\end{itemize}
\item\label{algitem:ndarray pipelining:step 4}
For all synchronizers of the current MPI rank:
\begin{itemize}
\item Wait until all data is copied to the local array.
\end{itemize}
\end{algsteps}
\end{algorithm}
For each pair of communicating processes:
Each copy operation started in the steps \ref{algitem:ndarray pipelining:step 1} and \ref{algitem:ndarray pipelining:step 3} involves launching a CUDA kernel that is placed into the same high-priority CUDA stream that was previously used for the processing of the corresponding lattice sites.
However, as mentioned in \cref{sec:distributed ndarray synchronization}, these copy operations can be avoided when the data is already stored in contiguous blocks of memory.
In general, this can be ensured only with a 1D distribution of the lattice and with an appropriate storage layout for the underlying four-dimensional $Q \times N_x \times N_y \times N_z$ array:
\begin{itemize}
\item Sending process:
\begin{enumerate}
\item copy values from the main array into a buffer
\item send values from the buffer
\end{enumerate}
\item Receiving process:
\begin{enumerate}
\item receive values into the buffer
\item copy values from the buffer into the main array
\end{enumerate}
\item
The $Q$-dimension is ordered first such that three-dimensional $N_x \times N_y \times N_z$ subarrays for each discrete distribution function are stored contiguously.
\item
The $x$-, $y$-, and $z$-dimensions are ordered depending on the dimension along which the lattice is decomposed.
If it is decomposed along the $x$-dimension, i.e., each subdomain has at most two neighbors, one to the left and one to the right, the array is ordered such that the $N_y \times N_z$ slices of the array for each $x$-coordinate are stored contiguously.
Furthermore, each 1D slice of the length $N_y$ along the $y$-dimension is stored contiguously.
\end{itemize}
Note that each copy operation above involves running a CUDA kernel.
Another important optimization is that the boundary data should be copied without using custom buffers.
This is possible in general only with a 1D distribution of the processes (i.e., processes are numbered linearly and each has at most two neighbors: one on the left and one on the right) and with an appropriate storage layout for the underlying multidimensional array (see \cref{sec:ndarray}) in order to ensure contiguity of the transferred data.
Based on the layout of the four-dimensional array storing the values of discrete distribution functions, the CUDA thread block size for LBM computations is selected as $(1, B_y, B_z)$, where the parameters $B_y$ and $B_z$ are determined based on the lattice sizes $N_y$ and $N_z$ by \cref{alg:LBM:CUDA thread block size} to achieve coalesced memory accesses \cite{nvidia:cuda}.
The constraints are that $N_y$ must be a multiple of $B_y$, $N_z$ must be a multiple of $B_z$, and $B_y$ must be a multiple of 32 (i.e., the \emph{warp size} on contemporary Nvidia GPUs \cite{nvidia:cuda}).
The algorithm tries to maximize the number of threads in the block up to \ic{max_threads}, a limit that we set empirically as $\ic{max_threads}=256$ for a single-precision data type and $\ic{max_threads}=128$ for a double-precision data type.
These optimizations can be implemented with a so-called \emph{CUDA-aware} MPI library and using CUDA streams for the management of asynchronous operations on the GPU.
\begin{algorithm}[CUDA thread block size selection for LBM]
\label{alg:LBM:CUDA thread block size}
\begin{algsteps}
\item Set $B_y :=32$, $B_z :=1$.
\item Set $M_y :=\floor{N_y /32}$.
\item If $32 M_y \ne N_y$, return $(B_y, B_z)$.
\item For $n_z :=1, 2, \ldots, N_z$:
\begin{algsteps}
\item If $N_z \bmod n_z \ne0$, go to the next iteration.
\item For $m :=1, 2, \ldots, M_y$:
\begin{algsteps}
\item Set $n_y :=32 m$.
\item If $N_y \bmod n_y \ne0$, go to the next iteration.
\item If $n_y n_z \le\ic{max_threads}$ and $n_y n_z > B_y B_z$, set $B_y := n_y$ and $B_z := n_z$.
\end{algsteps}
\end{algsteps}
\item Return $(B_y, B_z)$.
\end{algsteps}
\end{algorithm}
\inline{multidimensional domain decomposition?}
\inline{add a note about \emph{CUDA-aware} MPI library and GPUDirect (which does not work on any system that we had access to)}
@@ -16,6 +16,8 @@ The TNL project's documentation \cite{TNL:documentation} provides an overview of
In the following subsections, we describe the \emph{distributed} version of the data structure, which is especially useful for the algorithms presented in \cref{chapter:numerical methods}.
\subsection{Distributed multidimensional array}
\label{sec:distributed ndarray}
In the \emph{distributed} configuration, a global multidimensional array is decomposed into several subarrays and each MPI rank typically stores the data associated to one subarray.
Since a multidimensional array stores structured data, we will consider only \emph{structured conforming} decompositions, where the array is split by hyperplanes perpendicular to one of the axes.
\Cref{fig:ndarray decomposition} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays.
@@ -61,6 +63,8 @@ General description of such algorithms is out of scope of this work.
In the following subsection, we focus on data exchange in stencil computations, which are common in numerical analysis.
\subsection{Overlaps and data synchronization}
\label{sec:distributed ndarray synchronization}
Stencils in numerical analysis are geometric arrangements of nodes on a structured grid that express which data is accessed from a grid node of interest and used in the numerical computation.
The application of stencils on the data stored in a non-distributed multidimensional array is straightforward as all data is readily available in the memory of one rank.
However, in a distributed configuration, the multidimensional array needs to be extended with \emph{overlaps} to facilitate access to data owned by neighboring ranks, and \emph{data synchronization} to exchange data in the overlapping regions of the array when the stencil is applied iteratively.