@@ -25,16 +25,16 @@ The Cartesian coordinates of grid nodes are defined as $\vec x_{i,j,k} = [i \del
The sets $\mathcal G_\Omega$ and $\mathcal G_{\overline\Omega}$ of interior and all grid nodes, respectively, are defined by specifying the indices $i,j,k$ for the points $\vec x_{i,j,k}$ as
\begin{subequations}\label{eq:lbm:grid}
\begin{alignat}{5}
\mathcal G_\Omega&= \{\vec x_{i,j,k}&\;\in\Omega&\mid&\; i &\in\{1, \ldots, N_x-1\}, &\; j &\in\{1, \ldots, N_y-1\}, &\; k &\in\{1, \ldots, N_z-1\}\}, \\
\mathcal G_{\overline\Omega}&= \{\vec x_{i,j,k}&&\mid&\; i &\in\{0, \ldots, N_x\}, &\; j &\in\{0, \ldots, N_y\}, &\; k &\in\{0, \ldots, N_z\}\},
\mathcal G_\Omega&= \{\vec x_{i,j,k}&\;\in\Omega&\mid&\; i &\in\{1, \ldots, N_x-2\}, &\; j &\in\{1, \ldots, N_y-2\}, &\; k &\in\{1, \ldots, N_z-2\}\}, \\
\mathcal G_{\overline\Omega}&= \{\vec x_{i,j,k}&&\mid&\; i &\in\{0, \ldots, N_x-1\}, &\; j &\in\{0, \ldots, N_y-1\}, &\; k &\in\{0, \ldots, N_z-1\}\},
\end{alignat}
\end{subequations}
where the integers $N_x, N_y, N_z$ determine the numbers of grid nodes along the $x$, $y$, and $z$ axes, respectively.
To avoid mixing different unit systems, we also define a non-dimensional lattice $\mathcal L_\Omega$ and extended lattice $\mathcal L_{\overline\Omega}$ as
\begin{subequations}\label{eq:lbm:lattice}
\begin{alignat}{5}
\mathcal L_\Omega&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{1, \ldots, N_x-1\}, &\; j &\in\{1, \ldots, N_y-1\}, &\; k &\in\{1, \ldots, N_z-1\}\}, \\
\mathcal L_{\overline\Omega}&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{0, \ldots, N_x\}, &\; j &\in\{0, \ldots, N_y\}, &\; k &\in\{0, \ldots, N_z\}\},
\mathcal L_\Omega&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{1, \ldots, N_x-2\}, &\; j &\in\{1, \ldots, N_y-2\}, &\; k &\in\{1, \ldots, N_z-2\}\}, \\
\mathcal L_{\overline\Omega}&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{0, \ldots, N_x-1\}, &\; j &\in\{0, \ldots, N_y-1\}, &\; k &\in\{0, \ldots, N_z-1\}\},
\end{alignat}
\end{subequations}
where $\Delta x$ is the lattice spacing parameter that we set to $\Delta x =1$ as usual in the LBM community~\cite{kruger:2017lattice}.
@@ -51,17 +51,17 @@ Not all velocity sets are suitable for accurate resolution of the desired macros
In this thesis, we aim to use LBM for the solution of the Navier--Stokes equations in $\mathbb R^3$ and thus consider the D3Q27 velocity set consisting of $Q =27$ discrete velocities.
A velocity set for LBM is fully defined by two sets of quantities: lattice velocities $\vec\xi_q$ and the corresponding weights $w_q$ for $q \in\{1, \ldots, Q\}$.
In all aforementioned velocity sets, all vectors $\vec\xi_q$ have components in $\{-c, 0, c\}$, where $c =\Delta x /\Delta t$ is the lattice speed.
Another important quantity that can be derived from these two sets is the lattice speed of sound $c_s$.
The aforementioned velocity sets can be derived using the Gauss--Hermite quadrature~\cite[Appendix~A.4]{kruger:2017lattice}.
Another approach explained in \cite[Section~3.4.7.2]{kruger:2017lattice} is to determine the weights by solving the system of equations representing general conditions that ensure rotational isotropy of the lattice.
The Python script in \cref{appendix:LBM:velocity sets} can be used to quickly verify the consistency of discrete velocities, weights and the speed of sound for a given velocity set, or even to design a new velocity set.
The Python script in \cref{appendix:LBM:velocity sets} can be used to quickly verify the consistency of discrete velocities, weights and the speed of sound for a given velocity set, or even to obtain the necessary conditions when designing a new velocity set.
Before discretization, \cref{eq:lbm:BTE} is first transformed to a non-dimensional form by scaling the physical quantities based on a characteristic length, time and density \cite{kruger:2017lattice}.
For the purpose of lattice computations, a system of \emph{lattice units} is defined by selecting $\delta_x$, $\delta_t$ and $\rho_0$ (i.e., density of an incompressible fluid) as the conversion factors.
In lattice units, we set the space step as $\Delta x =1$, the time step as $\Delta t =1$, and the average density $\hat\rho_0=1$.
Additionally, all microscopic velocities in the used velocity set have components in $\{-c, 0, c\}$, where $c =\Delta x /\Delta t$ is the lattice speed.
The discretization of \cref{eq:lbm:BTE} in velocity space (using a velocity set), space (using a lattice) and time (using a discrete set of time levels) leads to the lattice Boltzmann equation
\begin{equation}\label{eq:lbm:LBE}
@@ -110,9 +110,10 @@ Furthermore, the approximations of the spatial velocity derivatives to reduce th
The simplest initialization approach, which is used in this work, is to set the discrete density distribution functions at $t =0$ to their equilibria according to the supplied macroscopic density and velocity using
Unless noted otherwise, the values $\rho(\vec x, 0)=1$ and $\vec v(\vec x, 0)=\vec0$ are used.
for all lattice sites $\vec x \in\mathcal L_{\overline\Omega}$.
Unless noted otherwise, the values $\rho(\vec x, 0)=\hat\rho_0$ and $\vec v(\vec x, 0)=\vec0$ are used.
Different initialization schemes can be found in \cite{kruger:2017lattice}.
\subsection{Streaming schemes}
@@ -133,12 +134,12 @@ Based on \cref{eq:lbm:LBE}, the computational steps needed to evolve the state f
f_q(\vec x + \Delta t \vec\xi_q, t + \Delta t) = f_q^\ast(\vec x, t),
\qquad\forall q \in\{1, \ldots, Q\},
\end{equation}
where the post-collision distribution function $f_q$ is propagated to the neighboring lattice site in the direction of the corresponding velocity $\vec\xi_q$.
where the post-collision distribution function $f_q^\ast$ is propagated to the neighboring lattice site in the direction of the corresponding velocity $\vec\xi_q$.
\end{enumerate}
\end{subequations}
This scheme is called the \emph{push scheme}, since the streaming step propagates the values from the reference lattice site $\vec x \in\mathcal L_\Omega$ to its neighbors.
As these two steps alternate in a time loop, it essentially does not matter if the time step is formulated to start with the collision step or the streaming step.
The alternative, yet equivalent can be formulated as:
The alternative, yet equivalent scheme can be formulated as:
\begin{subequations}
\begin{enumerate}
\item The \emph{streaming} step:
@@ -146,7 +147,7 @@ The alternative, yet equivalent can be formulated as:
f_q(\vec x, t) = f_q^\ast(\vec x - \Delta t \vec\xi_q, t - \Delta t),
\qquad\forall q \in\{1, \ldots, Q\},
\end{equation}
where the post-collision distribution function $f_q$ is propagated from the neighboring lattice site in the direction of $-\vec\xi_q$ from the previous time level.
where the post-collision distribution function $f_q^\ast$ is propagated from the neighboring lattice site in the direction of $-\vec\xi_q$ from the previous time level.
@@ -158,7 +159,7 @@ The alternative, yet equivalent can be formulated as:
This scheme is called the \emph{pull scheme}, since the streaming step propagates the values to the reference lattice site $\vec x \in\mathcal L_\Omega$ from its neighbors.
The visualization of the difference between the push and pull streaming schemes is illustrated in \cref{fig:lbm:AB push,fig:lbm:AB pull} for a D2Q9 velocity set.
Note that regardless of the chosen streaming scheme, the collision step is \emph{local} in terms of the lattice sites $\vec x \in\mathcal L_\Omega$ (i.e., collisions in different lattice sites are independent of each other), but collective for all $q \in\{1, \ldots, Q\}$ since collision operator $\mathcal C_q$ depends on the whole set of distribution functions in given lattice site.
Note that regardless of the chosen streaming scheme, the collision step is \emph{local} in terms of the lattice sites $\vec x \in\mathcal L_\Omega$ (i.e., collisions in different lattice sites are independent of each other), but collective for all $q \in\{1, \ldots, Q\}$ since the collision operator $\mathcal C_q$ depends on the whole set of discrete density distribution functions in given lattice site.
On the other hand, the operations in the streaming step are \emph{non-local}, but trivial as they do not require any computation (the post-collision values are merely transferred from one place to another).
Both the push and pull schemes require attention during the implementation to avoid overwriting memory locations containing values that may be needed in the future.
@@ -237,10 +238,12 @@ The time-stepping procedure using the A-A pattern can be formalized as follows:
\item Go to the next iteration ($t := t +\Delta t$).
\end{enumerate}
\end{subequations}
Note that for each lattice site $\vec x \in\mathcal L_\Omega$, the steps \cref{eq:lbm:AA-pattern:odd pre-collision step,eq:lbm:AA-pattern:odd post-collision step} access exactly the same memory locations in the array \ic{A}, since for the whole set of density distribution functions, we can invert the direction $q$ in \cref{eq:lbm:AA-pattern:odd pre-collision step} to obtain $f_{\overline q}(\vec x, t)=\texttt{A[}q, \vec x +\Delta t \vec\xi_q\texttt{]}$, which uses the same array access indices as \cref{eq:lbm:AA-pattern:odd post-collision step}.
Note that for each lattice site $\vec x \in\mathcal L_\Omega$, the steps \cref{eq:lbm:AA-pattern:odd pre-collision step,eq:lbm:AA-pattern:odd post-collision step} access exactly the same memory locations in the array \ic{A}, since for the whole set of discrete density distribution functions, we can invert the direction $q$ in \cref{eq:lbm:AA-pattern:odd pre-collision step} to obtain $f_{\overline q}(\vec x, t)=\texttt{A[}q, \vec x +\Delta t \vec\xi_q\texttt{]}$, which uses the same array access indices as \cref{eq:lbm:AA-pattern:odd post-collision step}.
The operations in the A-A pattern streaming scheme in the even and odd iterations are visualized for a D2Q9 velocity set in \cref{fig:lbm:AA even,fig:lbm:AA odd}, respectively.
\inline{add a note regarding the implementation of boundary conditions (and investigate it in our code)}
The A-A pattern reduces memory requirements for LBM, but removes the freedom of accessing neighboring values at the current time level separately from the streaming pattern.
This causes problems when implementing certain types of boundary conditions, such as those based on extrapolation.
The author is not aware of any literature dealing with issues related to correct implementation of boundary conditions using the A-A pattern.
\begin{figure}[tb]
\centering
@@ -330,16 +333,17 @@ More formally, \cref{alg:parallel LBM} summarizes the steps with an explicitly h
\end{algsteps}
\end{algorithm}
\Cref{alg:parallel LBM} is applicable parallel architectures that utilize a single shared memory, such as multicore CPUs within a single node or a single GPU accelerator.
\Cref{alg:parallel LBM} is applicable to parallel architectures that utilize a single shared memory, such as multicore CPUs within a single node or a single GPU accelerator.
In the latter case, the algorithm needs to be extended with data copies to and from the device memory due to input and output of relevant macroscopic quantities, but these steps are omitted in \cref{alg:parallel LBM} for clarity.
In order to utilize multiple GPUs or even multiple nodes for computing in a distributed fashion, an additional level of coarse-grained parallelism has to be introduced based on the domain decomposition approach.
The global lattice $\mathcal L_{\overline\Omega}$ is split into subdomains $\mathcal L_{\overline\Omega, j}$, $j \in\{1, \ldots, N_{\mathrm{proc}}\}$ which are processed independently using $N_{\mathrm{proc}}$ processes.
Each subdomain is assigned to one MPI rank that is mapped to one coarse-grain computational unit, such as a CPU core, a CPU socket, or a GPU accelerator, and is responsible for processing computations related to this subdomain.
In order to obtain consistent results, it must be ensured that the values of discrete density functions are propagated correctly across the interfaces between subdomains.
The global lattice $\mathcal L_{\overline\Omega}$ is split into subdomains $\mathcal L_{\overline\Omega, j}$, $j \in\{1, \ldots, N_{\mathrm{part}}\}$ which are processed independently using $N_{\mathrm{proc}}\le N_{\mathrm{part}}$ independent processes.
Each subdomain is assigned to one MPI rank that is mapped to one coarse-grain computational unit, such as a CPU core, a CPU socket, or a GPU accelerator.
Each MPI rank is responsible for processing computations related to one or more subdomains that were assigned to it.
In order to obtain consistent results, it must be ensured that the values of discrete density distribution functions are propagated correctly across the interfaces between subdomains.
However, due to the distributed memory system, this cannot happen automatically as part of the streaming steps in \cref{alg:parallel LBM}.
The solution adopted in our implementation is to extend each subdomain $\mathcal L_{\overline\Omega, j}$ with \emph{ghost nodes} which are excluded from parallel processing, but provide additional storage in the \ic{A} and \ic{B} arrays where other lattice sites will write and/or read values that will be later streamed to/from the neighboring subdomains.
After the parallel loop in each time step, parts of the \ic{A} (or \ic{B}) array corresponding to the ghost nodes are exchanged via MPI among neighboring subdomains.
The solution adopted in our implementation is to extend each subdomain $\mathcal L_{\overline\Omega, j}$ with \emph{ghost lattice sites} which are excluded from parallel processing, but provide additional storage in the \ic{A} and \ic{B} arrays where other lattice sites will write and/or read values that will be later streamed to/from the neighboring subdomains.
After the parallel loop in each time step, parts of the \ic{A} (or \ic{B}) array corresponding to the ghost lattice sites are exchanged via MPI among neighboring subdomains.
This approach is summarized below in \cref{alg:distributed LBM}.
\begin{algorithm}[Distributed LBM]
@@ -348,7 +352,7 @@ This approach is summarized below in \cref{alg:distributed LBM}.
For each subdomain $\mathcal L_{\overline\Omega, j}$:
\begin{itemize}[labelsep=1em, itemindent=-0.1em]
\item Parallel initialization for all $\vec x \in\mathcal L_{\overline\Omega, j}$, excluding the ghost nodes (step \ref{algitem:initialization} in \cref{alg:LBM}).
\item Parallel initialization for all $\vec x \in\mathcal L_{\overline\Omega, j}$, excluding the ghost lattice sites (step \ref{algitem:initialization} in \cref{alg:LBM}).
Copy distribution functions in the \ic{A} array on the interfaces between neighboring subdomains.
@@ -356,7 +360,7 @@ This approach is summarized below in \cref{alg:distributed LBM}.
\item While $t \le t_{\max}$:
\begin{algsteps}
\item\label{algitem:distributed LBM:compute}
For each subdomain and for all $\vec x \in\mathcal L_{\overline\Omega, j}$ (excluding the ghost nodes) in parallel:
For each subdomain and for all $\vec x \in\mathcal L_{\overline\Omega, j}$ (excluding the ghost lattice sites) in parallel:
\begin{itemize}[labelsep=1em, itemindent=-0.5em]
\item Perform the steps \ref{algitem:forcing} to \ref{algitem:output} from \cref{alg:LBM}, but only for the given lattice site $\vec x$.
\end{itemize}
@@ -374,8 +378,8 @@ In case of the D3Q27 velocity set and a 1D domain decomposition where each subdo
However, even with the aforementioned optimization, a naive implementation of \cref{alg:distributed LBM} does not scale with increasing number of MPI ranks.
In order to improve the scaling of the distributed LBM algorithm, several optimizations are necessary.
First of all, it is necessary to \emph{overlap} computation with communication in order to hide the extra latency which was not present in \cref{alg:parallel LBM}.
This can be done by processing the boundary lattice sites of each subdomain separately from the interior lattice sites, copying the data between subdomains as soon as the computation on the boundary has finished, and processing the interior lattice sites independently while all processes exchange the boundary data.
This approach is summarized in \cref{alg:distributed LBM with overlapping} using \emph{asynchronous} operations for communication between processes.
This can be done by processing the boundary lattice sites of each subdomain separately from the interior lattice sites, copying the data between subdomains as soon as the computation on the boundary has finished, and processing the interior lattice sites independently while all MPI ranks exchange their boundary data.
This approach is summarized in \cref{alg:distributed LBM with overlapping} using \emph{asynchronous operations}.
\begin{algorithm}[Distributed LBM with overlapped computation and communication]
\label{alg:distributed LBM with overlapping}
@@ -411,8 +415,8 @@ Likewise, all steps \ref{algitem:distributed LBM:communication 1} and \ref{algit
\label{sec:lbm:implementation}
This subsection covers practical aspects related to the LBM implementation \todo{cite something -- papers or (internal) GitLab project} in our code base.
The data structures needed in LBM are discussed and the templated object-oriented design of our LBM implementation is described.
Performance optimizations will be described in \cref{sec:lbm:optimization}.
It focuses primarily on the data structures needed in LBM and on the templated object-oriented design of our LBM implementation.
Performance optimizations are be described in \cref{sec:lbm:optimization}.
Several multidimensional arrays are needed to store relevant values in the LBM implementation:
\begin{itemize}
@@ -423,16 +427,16 @@ Several multidimensional arrays are needed to store relevant values in the LBM i
\emph{Macroscopic quantities} are stored in a 4-dimensional array with the size $N_m \times N_x \times N_y \times N_z$, where $N_m$ is the number of macroscopic quantities stored during the simulation.
In case of the Navier--Stokes equations, at least 4 quantities (density and three components of velocity) need to be stored, but additional user-defined quantities can be added.
\item
\emph{Lattice site tags} are stored in a 3-dimensional array with the size $N_x \times N_y \times N_z$ and type of elements \ic{short int}.
The values in this array determine the \emph{type} of each lattice site and it is used to identify different boundary conditions and geometry of immersed bodies.
\emph{Lattice site tags} are stored in a 3-dimensional array with the size $N_x \times N_y \times N_z$ and the type of elements \ic{short int}.
The values in this array determine the \emph{type} of each lattice site and they are used to identify different boundary conditions and geometry of immersed bodies.
\end{itemize}
The data structure used for all of these arrays, \ic{TNL::Containers::DistributedNDArray}, has been described in \cref{sec:ndarray}.
Recall that the global array corresponding to the lattice $\mathcal L_{\overline\Omega}$ must be decomposed in a \emph{structured conforming} manner, but each rank can be assigned multiple subdomains and the rank numbering is arbitrary.
The decomposition of the aforementioned arrays in LBM must be consistent.
Hence, we collect all distributed data structures in a class called \ic{LBM_BLOCK} that contains all data associated to one subdomain: local subarrays, subdomain sizes and offsets, indices of neighboring blocks and ranks that own them.
To facilitate computations on GPU accelerators, two objects in different memory spaces are needed for each array.
One object allocates the array data in the host memory where it can be processed sequentially by the CPU, and the other object allocates the array data in the device memory where it can be processed in parallel by the GPU accelerator.
To facilitate computations on GPU accelerators, two objects for each array are needed to represent data in different memory spaces.
One object allocates the array data in the host memory where it can be processed by the CPU, and the other object allocates the array data in the device memory where it can be processed by the GPU accelerator.
The main part of the LBM (i.e., steps \ref{algitem:LBM-overlapping:boundaries start} and \ref{algitem:LBM-overlapping:interior start} in \cref{alg:distributed LBM with overlapping}) is computed on GPU accelerators, but CPU processing is used for initialization and output of the results.
During the computation of the LBM algorithm, relevant arrays must be copied from one memory space to another whenever processing is switched from CPU to GPU or vice versa.
Furthermore, additional attribute represented by the \ic{LBM_DATA} class is added to \ic{LBM_BLOCK}, which contains the data that need to be passed to the CUDA kernel when the processing of the block on the GPU starts.
@@ -502,7 +506,7 @@ For example, when the D3Q27 velocity set is used and the lattice is distributed
Note that in case of a 2D or 3D decomposition, the communication pattern is more complicated as it is necessary to also synchronize relevant distribution functions across edges and corners of the interfaces between subdomains.
The synchronizer objects are used in \cref{alg:distributed LBM with overlapping} in the steps \ref{algitem:LBM-overlapping:communication 1} and \ref{algitem:LBM-overlapping:communication 2}.
The synchronization procedure corresponds to \cref{alg:distributed ndarray synchronization} with \emph{pipelining} among all synchronizers as mentioned in \cref{sec:distributed ndarray synchronization}.
The synchronization procedure corresponds to \cref{alg:distributed ndarray synchronization}on \cpageref{alg:distributed ndarray synchronization}with \emph{pipelining} among all synchronizers as mentioned in \cref{sec:distributed ndarray synchronization}.
More formally, the pipelined synchronization as executed by one MPI rank is summarized in \cref{alg:pipelined ndarray synchronization}.
\begin{algorithm}[Pipelined synchronization of distributed multidimensional arrays]
@@ -523,7 +527,7 @@ More formally, the pipelined synchronization as executed by one MPI rank is summ
\item\label{algitem:ndarray pipelining:step 3}
For all synchronizers of the current MPI rank:
\begin{itemize}
\item Wait for all MPI communications the neighbors to complete.
\item Wait for all MPI communications with the neighbors to complete.
\item Start copying data from the receive buffers to the local array.
\end{itemize}
\item\label{algitem:ndarray pipelining:step 4}
@@ -546,7 +550,7 @@ In general, this can be ensured only with a 1D distribution of the lattice and w
Furthermore, each 1D slice of the length $N_y$ along the $y$-dimension is stored contiguously.
\end{itemize}
Based on the layout of the four-dimensional array storing the values of discrete distribution functions, the CUDA thread block size for LBM computations is selected as $(1, B_y, B_z)$, where the parameters $B_y$ and $B_z$ are determined based on the lattice sizes $N_y$ and $N_z$ by \cref{alg:LBM:CUDA thread block size} to achieve coalesced memory accesses \cite{nvidia:cuda}.
Based on the layout of the four-dimensional array storing the values of discrete distribution functions, the CUDA thread block size for LBM computations is selected as $(1, B_y, B_z)$, where the parameters $B_y$ and $B_z$ are determined by \cref{alg:LBM:CUDA thread block size} based on the lattice sizes $N_y$ and $N_z$ to achieve coalesced memory accesses \cite{nvidia:cuda}.
The constraints are that $N_y$ must be a multiple of $B_y$, $N_z$ must be a multiple of $B_z$, and $B_y$ must be a multiple of 32 (i.e., the \emph{warp size} on contemporary Nvidia GPUs \cite{nvidia:cuda}).
The algorithm tries to maximize the number of threads in the block up to \ic{max_threads}, a limit that we set empirically as $\ic{max_threads}=256$ for a single-precision data type and $\ic{max_threads}=128$ for a double-precision data type.