@@ -19,18 +19,30 @@ Instead, we review the components needed to describe the implementation of the m
\subsubsection{Lattice and velocity set}
We consider the discretization of a cuboidal domain $\Omega\subset\mathbb R^D$ using an equidistant orthogonal grid (lattice) represented by a finite set $\hat\Omega$ of grid nodes (lattice sites).
The extended grid $\overline{\hat\Omega}$ consists of the interior nodes from $\hat\Omega$ and an additional outer layer of nodes, which is added to facilitate the discretization of the domain boundary.
We consider the discretization of a cuboidal domain $\Omega\subset\mathbb R^D$ using an equidistant orthogonal grid represented by a finite set $\mathcal G_\Omega$ of grid nodes.
The extended grid $\mathcal G_{\overline\Omega}$ consists of the interior nodes from $\mathcal G_\Omega$ and an additional outer layer of nodes, which is added to facilitate the discretization of the domain boundary.
The Cartesian coordinates of grid nodes are defined as $\vec x_{i,j,k}=[i \delta_x + O_x, j \delta_x + O_y, k \delta_x + O_z]^T$, where $\delta_x$~[\si{\metre}] is the grid spacing parameter and $\vec O =[O_x, O_y, O_z]^T$~[\si{\metre}] is an offset vector that allows to fit the grid nodes to the physical coordinates of the domain.
The sets $\hat\Omega$ and $\overline{\hat\Omega}$ of interior and all grid nodes, respectively, are defined by specifying the indices $i,j,k$ for the points $\vec x_{i,j,k}$ as
The sets $\mathcal G_\Omega$ and $\mathcal G_{\overline\Omega}$ of interior and all grid nodes, respectively, are defined by specifying the indices $i,j,k$ for the points $\vec x_{i,j,k}$ as
\begin{subequations}\label{eq:lbm:grid}
\begin{alignat}{5}
\hat\Omega&= \{\vec x_{i,j,k}&\;\in\Omega&\mid&\; i &\in\{1, \ldots, N_x-1\}, &\; j &\in\{1, \ldots, N_y-1\}, &\; k &\in\{1, \ldots, N_z-1\}\}, \\
\overline{\hat\Omega}&= \{\vec x_{i,j,k}&&\mid&\; i &\in\{0, \ldots, N_x\}, &\; j &\in\{0, \ldots, N_y\}, &\; k &\in\{0, \ldots, N_z\}\},
\mathcal G_\Omega&= \{\vec x_{i,j,k}&\;\in\Omega&\mid&\; i &\in\{1, \ldots, N_x-1\}, &\; j &\in\{1, \ldots, N_y-1\}, &\; k &\in\{1, \ldots, N_z-1\}\}, \\
\mathcal G_{\overline\Omega}&= \{\vec x_{i,j,k}&&\mid&\; i &\in\{0, \ldots, N_x\}, &\; j &\in\{0, \ldots, N_y\}, &\; k &\in\{0, \ldots, N_z\}\},
\end{alignat}
\end{subequations}
where the indices $N_x, N_y, N_z$ determine the numbers of grid nodes along the $x$, $y$, and $z$ axes, respectively.
Similarly, the time interval $I =\left[0, t_{\max}\right]$ is discretized by a finite set $\hat I =\left\{\ell\delta_t \mid\ell\in\{0, \ldots, N_t\}\right\}$, where $\delta_t = t_{\max}/ N_t$ and $N_t$ is the number of time steps.
where the integers $N_x, N_y, N_z$ determine the numbers of grid nodes along the $x$, $y$, and $z$ axes, respectively.
To avoid mixing different unit systems, we also define a non-dimensional lattice $\mathcal L_\Omega$ and extended lattice $\mathcal L_{\overline\Omega}$ as
\begin{subequations}\label{eq:lbm:lattice}
\begin{alignat}{5}
\mathcal L_\Omega&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{1, \ldots, N_x-1\}, &\; j &\in\{1, \ldots, N_y-1\}, &\; k &\in\{1, \ldots, N_z-1\}\}, \\
\mathcal L_{\overline\Omega}&= \{\Delta x [i,j,k]^T &&\mid&\; i &\in\{0, \ldots, N_x\}, &\; j &\in\{0, \ldots, N_y\}, &\; k &\in\{0, \ldots, N_z\}\},
\end{alignat}
\end{subequations}
where $\Delta x$ is the lattice spacing parameter that we set to $\Delta x =1$ as usual in the LBM community~\cite{kruger:2017lattice}.
Note that a grid node $\vec x_{i,j,k}$ represents the same point in space as the corresponding lattice site $\Delta x [i, j, k]^T$, but in different unit systems.
Hence, we will use the notation $\vec x \in\mathcal G_\Omega$ to refer to a point using its physical coordinates and $\vec x \in\mathcal L_\Omega$ to refer to its lattice coordinates.
Similarly to the space discretization, the time interval $\left[0, t_{\max}\right]$ is discretized by a finite set $\mathcal G_t =\left\{\ell\delta_t \mid\ell\in\{0, \ldots, N_t\}\right\}$, where $\delta_t = t_{\max}/ N_t$~\si{\second} is the time step in physical units and $N_t$ is the number of time steps.
The set of non-dimensional time levels is denoted as $\mathcal L_t =\left\{\ell\Delta t \mid\ell\in\{0, \ldots, N_t\}\right\}$, where $\Delta t =1$ is the non-dimensional time step.
The next step in the discretization of the Boltzmann transport equation \eqref{eq:lbm:BTE} is the selection of an appropriate discrete velocity set.
The velocity sets are typically named as D$D$Q$Q$, where $D$ stands for the spatial dimension and $Q$ denotes the number of discrete velocities per lattice site.
@@ -56,13 +68,12 @@ The discretization of \cref{eq:lbm:BTE} in velocity space (using a velocity set)
f_q(\vec x + \Delta t \vec\xi_q, t + \Delta t) - f_q(\vec x, t) = \Delta t \left( \mathcal C_q(\vec x, t) + \mathcal S_q(\vec x, t) \right)
%f_q(\vec x + \Delta t \vec \xi_q, t + \Delta t) - f_q(\vec x, t) = \Delta t \mathcal C_q(\vec x, t)
\end{equation}
\todo{fix dimensions of the quantities: $\vec x$~\si{\metre}, but $\Delta t \vec\xi_q$~\si{-}}
for all $q \in\{1, \ldots, Q\}$, $\vec x \in\hat\Omega$, and $t \in\hat I$.
for all $q \in\{1, \ldots, Q\}$, $\vec x \in\mathcal L_\Omega$, and $t \in\mathcal L_t$.
In \cref{eq:lbm:LBE}, $f_q = f_q(\vec x, t)$ is the discrete density distribution functions corresponding to the microscopic velocity $\vec\xi_q$, $\mathcal C_q$ is the discrete collision operator and $\mathcal S_q$ is the discrete source term.
Note that the collision operator $\mathcal C_q$ depends on the whole set of $Q$ discrete density distribution functions (i.e., the $Q$ equations written in \eqref{eq:lbm:LBE} are not independent).
Similarly to the continuous case, moments of the discrete density distribution functions $f_q$ are of interest as they allow to recover the macroscopic quantities.
The scheme for the approximation of the Navier--Stokes equations is constructed such that the conserved moments correspond to the macroscopic density $\rho$, which can be obtained in lattice units in $\hat\Omega\times\hat I$ by
The scheme for the approximation of the Navier--Stokes equations is constructed such that the conserved moments correspond to the macroscopic density $\rho$, which can be obtained in lattice units in $\mathcal L_\Omega\times\mathcal L_t$ by
@@ -125,7 +136,7 @@ Based on \cref{eq:lbm:LBE}, the computational steps needed to evolve the state f
where the post-collision distribution function $f_q$ is propagated to the neighboring lattice site in the direction of the corresponding velocity $\vec\xi_q$.
\end{enumerate}
\end{subequations}
This scheme is called the \emph{push scheme}, since the streaming step propagates the values from the reference lattice site $\vec x \in\hat\Omega$ to its neighbors.
This scheme is called the \emph{push scheme}, since the streaming step propagates the values from the reference lattice site $\vec x \in\mathcal L_\Omega$ to its neighbors.
As these two steps alternate in a time loop, it essentially does not matter if the time step is formulated to start with the collision step or the streaming step.
The alternative, yet equivalent can be formulated as:
\begin{subequations}
@@ -144,10 +155,10 @@ The alternative, yet equivalent can be formulated as:
where $f_q^\ast$ denotes the value of the distribution function \emph{after} collision in the given lattice site.
\end{enumerate}
\end{subequations}
This scheme is called the \emph{pull scheme}, since the streaming step propagates the values to the reference lattice site $\vec x \in\hat\Omega$ from its neighbors.
This scheme is called the \emph{pull scheme}, since the streaming step propagates the values to the reference lattice site $\vec x \in\mathcal L_\Omega$ from its neighbors.
The visualization of the difference between the push and pull streaming schemes is illustrated in \cref{fig:lbm:AB push,fig:lbm:AB pull} for a D2Q9 velocity set.
Note that regardless of the chosen streaming scheme, the collision step is \emph{local} in terms of the lattice sites $\vec x \in\hat\Omega$ (i.e., collisions in different lattice sites are independent of each other), but collective for all $q \in\{1, \ldots, Q\}$ since collision operator $\mathcal C_q$ depends on the whole set of distribution functions in given lattice site.
Note that regardless of the chosen streaming scheme, the collision step is \emph{local} in terms of the lattice sites $\vec x \in\mathcal L_\Omega$ (i.e., collisions in different lattice sites are independent of each other), but collective for all $q \in\{1, \ldots, Q\}$ since collision operator $\mathcal C_q$ depends on the whole set of distribution functions in given lattice site.
On the other hand, the operations in the streaming step are \emph{non-local}, but trivial as they do not require any computation (the post-collision values are merely transferred from one place to another).
Both the push and pull schemes require attention during the implementation to avoid overwriting memory locations containing values that may be needed in the future.
@@ -226,7 +237,7 @@ The time-stepping procedure using the A-A pattern can be formalized as follows:
\item Go to the next iteration ($t := t +\Delta t$).
\end{enumerate}
\end{subequations}
Note that for each lattice site $\vec x \in\hat\Omega$, the steps \cref{eq:lbm:AA-pattern:odd pre-collision step,eq:lbm:AA-pattern:odd post-collision step} access exactly the same memory locations in the array \ic{A}, since for the whole set of density distribution functions, we can invert the direction $q$ in \cref{eq:lbm:AA-pattern:odd pre-collision step} to obtain $f_{\overline q}(\vec x, t)=\texttt{A[}q, \vec x +\Delta t \vec\xi_q\texttt{]}$, which uses the same array access indices as \cref{eq:lbm:AA-pattern:odd post-collision step}.
Note that for each lattice site $\vec x \in\mathcal L_\Omega$, the steps \cref{eq:lbm:AA-pattern:odd pre-collision step,eq:lbm:AA-pattern:odd post-collision step} access exactly the same memory locations in the array \ic{A}, since for the whole set of density distribution functions, we can invert the direction $q$ in \cref{eq:lbm:AA-pattern:odd pre-collision step} to obtain $f_{\overline q}(\vec x, t)=\texttt{A[}q, \vec x +\Delta t \vec\xi_q\texttt{]}$, which uses the same array access indices as \cref{eq:lbm:AA-pattern:odd post-collision step}.
The operations in the A-A pattern streaming scheme in the even and odd iterations are visualized for a D2Q9 velocity set in \cref{fig:lbm:AA even,fig:lbm:AA odd}, respectively.
\inline{add a note regarding the implementation of boundary conditions (and investigate it in our code)}
@@ -260,27 +271,27 @@ The time-stepping algorithm for LBM can be generalized to cover all aforemention
\label{alg:LBM}
\begin{algsteps}
\item\label{algitem:initialization}
Initialization: for all $\vec x \in\overline{\hat\Omega}$ and $q \in\{1, \ldots, Q\}$, compute $f_q(\vec x, 0)$ using \eqref{eq:lbm:initialization} and write the results to the \ic{A} array.
Initialization: for all $\vec x \in\mathcal L_{\overline\Omega}$ and $q \in\{1, \ldots, Q\}$, compute $f_q(\vec x, 0)$ using \eqref{eq:lbm:initialization} and write the results to the \ic{A} array.
\item Set $t =\Delta t$.
\item While $t \le t_{\max}$:
\begin{algsteps}
\item\label{algitem:forcing}
Compute the external force density $\hat{\vec F}(\vec x, t)$ for all $\vec x \in\hat\Omega$.
Compute the external force density $\hat{\vec F}(\vec x, t)$ for all $\vec x \in\mathcal L_\Omega$.
\item\label{algitem:pre-collision streaming}
Pre-collision step according to the streaming pattern -- for all $\vec x \in\hat\Omega$ and $q \in\{1, \ldots, Q\}$:
Pre-collision step according to the streaming pattern -- for all $\vec x \in\mathcal L_\Omega$ and $q \in\{1, \ldots, Q\}$:
\begin{itemize}[labelsep=1em, itemindent=-0.5em]
\item A-B push scheme: read $f_q(\vec x, t)$ from its natural location in the \ic{A} array.
\item A-B pull scheme: read $f_q(\vec x, t)$ from the \ic{A} array according to \cref{eq:lbm:pull-scheme:streaming step}.
\item A-A pattern: read $f_q(\vec x, t)$ from the \ic{A} array according to \cref{eq:lbm:AA-pattern:even pre-collision step} or~\cref{eq:lbm:AA-pattern:odd pre-collision step}.
\end{itemize}
\item\label{algitem:moments}
For all $\vec x \in\hat\Omega$, compute the macroscopic density $\rho(\vec x, t)$ and velocity $\vec v(\vec x, t)$ using \cref{eq:lbm:rho,eq:lbm:momentum}.
For all $\vec x \in\mathcal L_\Omega$, compute the macroscopic density $\rho(\vec x, t)$ and velocity $\vec v(\vec x, t)$ using \cref{eq:lbm:rho,eq:lbm:momentum}.
\item\label{algitem:boundary}
Handle boundary conditions for all relevant lattice sites in $\overline{\hat\Omega}$.
Handle boundary conditions for all relevant lattice sites in $\mathcal L_{\overline\Omega}$.
\item\label{algitem:collision}
Collision step: compute $f_q^\ast(\vec x, t)$ for all $\vec x \in\hat\Omega$ and $q \in\{1, \ldots, Q\}$ using \cref{eq:lbm:push-scheme:collision step} (this step is the same in all streaming patterns).
Collision step: compute $f_q^\ast(\vec x, t)$ for all $\vec x \in\mathcal L_\Omega$ and $q \in\{1, \ldots, Q\}$ using \cref{eq:lbm:push-scheme:collision step} (this step is the same in all streaming patterns).
\item\label{algitem:post-collision streaming}
Post-collision step according to the streaming pattern -- for all $\vec x \in\hat\Omega$ and $q \in\{1, \ldots, Q\}$:
Post-collision step according to the streaming pattern -- for all $\vec x \in\mathcal L_\Omega$ and $q \in\{1, \ldots, Q\}$:
\begin{itemize}[labelsep=1em, itemindent=-0.5em]
\item A-B push scheme: write $f_q^\ast(\vec x, t)$ to the \ic{B} array according to \cref{eq:lbm:push-scheme:streaming step}.
\item A-B pull scheme: write $f_q^\ast(\vec x, t)$ to its natural location in the \ic{B} array.
@@ -305,12 +316,12 @@ More formally, \cref{alg:parallel LBM} summarizes the steps with an explicitly h
\label{alg:parallel LBM}
\begin{algsteps}
\item\label{algitem:parallel initialization}
Initialization: perform the step \ref{algitem:initialization} in \cref{alg:LBM}, but in parallel for all $\vec x \in\overline{\hat\Omega}$.
Initialization: perform the step \ref{algitem:initialization} in \cref{alg:LBM}, but in parallel for all $\vec x \in\mathcal L_{\overline\Omega}$.
\item Set $t =\Delta t$.
\item While $t \le t_{\max}$:
\begin{algsteps}
\item\label{algitem:parallel LBM:compute}
For all $\vec x \in\overline{\hat\Omega}$ in parallel:
For all $\vec x \in\mathcal L_{\overline\Omega}$ in parallel:
\begin{itemize}[labelsep=1em, itemindent=-0.5em]
\item Perform the steps \ref{algitem:forcing} to \ref{algitem:output} from \cref{alg:LBM}, but only for the given lattice site $\vec x$.
\end{itemize}
@@ -323,11 +334,11 @@ More formally, \cref{alg:parallel LBM} summarizes the steps with an explicitly h
In the latter case, the algorithm needs to be extended with data copies to and from the device memory due to input and output of relevant macroscopic quantities, but these steps are omitted in \cref{alg:parallel LBM} for clarity.
In order to utilize multiple GPUs or even multiple nodes for computing in a distributed fashion, an additional level of coarse-grained parallelism has to be introduced based on the domain decomposition approach.
The global lattice $\overline{\hat\Omega}$ is split into subdomains $\overline{\hat\Omega}_j$, $j \in\{1, \ldots, N_{\mathrm{proc}}\}$ which are processed independently using $N_{\mathrm{proc}}$ processes.
The global lattice $\mathcal L_{\overline\Omega}$ is split into subdomains $\mathcal L_{\overline\Omega, j}$, $j \in\{1, \ldots, N_{\mathrm{proc}}\}$ which are processed independently using $N_{\mathrm{proc}}$ processes.
Each subdomain is assigned to one MPI rank that is mapped to one coarse-grain computational unit, such as a CPU core, a CPU socket, or a GPU accelerator, and is responsible for processing computations related to this subdomain.
In order to obtain consistent results, it must be ensured that the values of discrete density functions are propagated correctly across the interfaces between subdomains.
However, due to the distributed memory system, this cannot happen automatically as part of the streaming steps in \cref{alg:parallel LBM}.
The solution adopted in our implementation is to extend each subdomain $\overline{\hat\Omega}$ with \emph{ghost nodes} which are excluded from parallel processing, but provide additional storage in the \ic{A} and \ic{B} arrays where other lattice sites will write and/or read values that will be later streamed to/from the neighboring subdomains.
The solution adopted in our implementation is to extend each subdomain $\mathcal L_{\overline\Omega, j}$ with \emph{ghost nodes} which are excluded from parallel processing, but provide additional storage in the \ic{A} and \ic{B} arrays where other lattice sites will write and/or read values that will be later streamed to/from the neighboring subdomains.
After the parallel loop in each time step, parts of the \ic{A} (or \ic{B}) array corresponding to the ghost nodes are exchanged via MPI among neighboring subdomains.
This approach is summarized below in \cref{alg:distributed LBM}.
@@ -335,9 +346,9 @@ This approach is summarized below in \cref{alg:distributed LBM}.
For each subdomain $\mathcal L_{\overline\Omega, j}$:
\begin{itemize}[labelsep=1em, itemindent=-0.1em]
\item Parallel initialization for all $\vec x \in\overline{\hat\Omega}_j$, excluding the ghost nodes (step \ref{algitem:initialization} in \cref{alg:LBM}).
\item Parallel initialization for all $\vec x \in\mathcal L_{\overline\Omega, j}$, excluding the ghost nodes (step \ref{algitem:initialization} in \cref{alg:LBM}).
Start processing \emph{lattice sites next to the interfaces with other subdomains}.
@@ -416,7 +427,7 @@ Several multidimensional arrays are needed to store relevant values in the LBM i
The values in this array determine the \emph{type} of each lattice site and it is used to identify different boundary conditions and geometry of immersed bodies.
\end{itemize}
The data structure used for all of these arrays, \ic{TNL::Containers::DistributedNDArray}, has been described in \cref{sec:ndarray}.
Recall that the global array corresponding to the lattice $\hat\Omega$ must be decomposed in a \emph{structured conforming} manner, but each rank can be assigned multiple subdomains and the rank numbering is arbitrary.
Recall that the global array corresponding to the lattice $\mathcal L_{\overline\Omega}$ must be decomposed in a \emph{structured conforming} manner, but each rank can be assigned multiple subdomains and the rank numbering is arbitrary.
The decomposition of the aforementioned arrays in LBM must be consistent.
Hence, we collect all distributed data structures in a class called \ic{LBM_BLOCK} that contains all data associated to one subdomain: local subarrays, subdomain sizes and offsets, indices of neighboring blocks and ranks that own them.