@@ -383,14 +383,97 @@ This approach is summarized in \cref{alg:distributed LBM with overlapping} using
\end{algorithm}
The steps \ref{algitem:LBM-overlapping:boundaries start} and \ref{algitem:LBM-overlapping:interior start} in \cref{alg:distributed LBM with overlapping} involve the same local operations as the step \ref{algitem:parallel LBM:compute} in \cref{alg:parallel LBM}, but on different sets of lattice sites.
Likewise, all steps \ref{algitem:distributed LBM:communication 1} and \ref{algitem:distributed LBM:communication 2} in \cref{alg:distributed LBM}, and steps \ref{algitem:LBM-overlapping:communication 1} and \ref{algitem:LBM-overlapping:communication 2} in \cref{alg:distributed LBM with overlapping}, perform exactly the same operations that are explained further in...
\todo{Explain MPI synchronization in a separate algorithm}
Likewise, all steps \ref{algitem:distributed LBM:communication 1} and \ref{algitem:distributed LBM:communication 2} in \cref{alg:distributed LBM}, and steps \ref{algitem:LBM-overlapping:communication 1} and \ref{algitem:LBM-overlapping:communication 2} in \cref{alg:distributed LBM with overlapping}, perform exactly the same operations that are explained further in \cref{sec:lbm:optimization}
\subsection{Implementation remarks}
\label{sec:lbm:implementation}
\inline{note that each rank can have multiple subdomains}
\inline{explanation how the DFs are organized in a multi-dimensional arrays and how the synchronization works}
\inline{optimizations for the MPI implementation: CUDA streams for overlapping computation and communication, NDArraySynchronizer, avoiding buffering for contiguous data ranges, pipelining, CUDA streams for copy kernels when buffering is needed}
This subsection covers practical aspects related to the LBM implementation \todo{cite something -- papers or (internal) GitLab project} in our code base.
The data structures needed in LBM are discussed and the templated object-oriented design of our LBM implementation is described.
Performance optimizations will be described in \cref{sec:lbm:optimization}.
Several multidimensional arrays are needed to store relevant values in the LBM implementation:
\begin{itemize}
\item
\emph{Distribution functions} are stored in a 4-dimensional array with the size $Q \times N_x \times N_y \times N_z$.
Depending on the streaming scheme, either one (\ic{A}) or two (\ic{A} and \ic{B}) such arrays are necessary.
\item
\emph{Macroscopic quantities} are stored in a 4-dimensional array with the size $N_m \times N_x \times N_y \times N_z$, where $N_m$ is the number of macroscopic quantities stored during the simulation.
In case of the Navier--Stokes equations, at least 4 quantities (density and three components of velocity) need to be stored, but additional user-defined quantities can be added.
\item
\emph{Lattice site tags} are stored in a 3-dimensional array with the size $N_x \times N_y \times N_z$ and type of elements \ic{short int}.
The values in this array determine the \emph{type} of each lattice site and it is used to identify different boundary conditions and geometry of immersed bodies.
\end{itemize}
The data structure used for all of these arrays, \ic{TNL::Containers::DistributedNDArray}, has been described in \cref{sec:ndarray}.
Recall that the global array corresponding to the lattice $\hat\Omega$ must be decomposed in a \emph{structured conforming} manner, but each rank can be assigned multiple subdomains and the rank numbering is arbitrary.
The decomposition of the aforementioned arrays in LBM must be consistent.
Hence, we collect all distributed data structures in a class called \ic{LBM_BLOCK} that contains all data associated to one subdomain: local subarrays, subdomain sizes and offsets, indices of neighboring blocks and ranks that own them.
To facilitate computations on GPU accelerators, two objects in different memory spaces are needed for each array.
One object allocates the array data in the host memory where it can be processed sequentially by the CPU, and the other object allocates the array data in the device memory where it can be processed in parallel by the GPU accelerator.
The main part of the LBM (i.e., steps \ref{algitem:LBM-overlapping:boundaries start} and \ref{algitem:LBM-overlapping:interior start} in \cref{alg:distributed LBM with overlapping}) is computed on GPU accelerators, but CPU processing is used for initialization and output of the results.
During the computation of the LBM algorithm, relevant arrays must be copied from one memory space to another whenever processing is switched from CPU to GPU or vice versa.
Furthermore, additional attribute represented by the \ic{LBM_DATA} class is added to \ic{LBM_BLOCK}, which contains the data that need to be passed to the CUDA kernel when the processing of the block on the GPU starts.
This includes pointers to the arrays allocated in the device memory, sizes of the lattice, and scalar parameters (e.g., viscosity and parameters related to boundary conditions).
The \ic{LBM_BLOCK} objects representing individual subdomains managed by the current process (MPI rank) are aggregated in the class named \ic{LBM}.
The \ic{LBM} class provides member functions to manipulate all managed blocks collectively, as well as additional functionality related to LBM simulations, such as variables for time-stepping, or conversion between physical and lattice units.
From a high level point of view, each MPI rank creates one object of the \ic{State} class, which combines all components needed to run a simulation.
It contains one object of the \ic{LBM} class to represent one set of discrete distribution functions for a Navier--Stokes problem.
The \ic{State} class is designed to be extended via inheritance, so users can define their own subclass for a particular case and adjust the behavior of the solver by adding attributes and overriding various virtual member functions.
Hence, it is possible to implement more complicated solvers for coupled problems, such as the Navier--Stokes--Fourier problem of fluid flow coupled with thermal transport \cite{Sharma2020}, which can be solved using multiple sets of discrete distribution functions represented by objects of the \ic{LBM} class.
Up to now, the description of the implementation revolved around data structures and classes, leading to the \ic{State} class that provides extensibility via subclassing.
This approach is suitable for high-level adaptions that reuse the existing low-level building blocks.
However, it is often necessary to modify even the low-level details, such as local per-lattice-site computations in the LBM algorithm.
For example, it may be desirable to switch to a different collision operator or to define additional macroscopic quantities.
In order to sustain high-performance computing on GPU accelerators that have limited support for virtual functions \cite{nvidia:cuda}, using compile-time techniques rather than dynamic polymorphism is more appropriate.
Therefore, our LBM implementation is based on several template meta-programming components that allow to configure the low-level details while preserving the generation of optimal CUDA kernels for each configuration.
The starting element of template meta-programming used in the code is the \ic{LBM_CONFIG} template class that represents a particular configuration for LBM.
It combines the following components that are set via template parameters:
\begin{itemize}
\item
\ic{TRAITS} is a structure that holds the definition of fundamental types used in the solver, such as the floating-point type, index type, permutations of the multidimensional arrays, etc.
\item
\ic{KERNEL_STRUCT} is a structure that is used in the CUDA kernel to store local data related to the given lattice site.
In particular, it contains an array of $Q$ discrete distribution functions that are read and written during the streaming steps.
\item
\ic{DATA} is a structure that refers to the \ic{LBM_DATA} class, or its user-defined subclass.
Its purpose is to pass data from the host to the CUDA kernel.
Besides the attributes contained in \ic{LBM_DATA}, it may be used to provide data for user-defined extensions of the following components.
\item
\ic{COLL} is a structure that implements a particular collision operator.
The action of the operator is computed in its static member function \ic{collision} using the values in an instance of \ic{KERNEL_STRUCT}.
\item
\ic{EQ} is a structure that implements a particular discrete equilibrium function.
\item
\ic{STREAMING} is a structure that implements a particular streaming scheme.
\item
\ic{BC} is a structure that defines tags that may be assigned to lattice sites and implements actions to be taken in the CUDA kernel based on these tags.
It is intended for user modification in order to specify boundary conditions for a particular simulation.
\item
\ic{MACRO} is a structure that implements the computation of macroscopic quantities.
%\item \ic{CPU_MACRO}
\end{itemize}
Only the \ic{KERNEL_STRUCT} and \ic{DATA} components are intended to represent data, the other components are never instantiated and serve merely to define types or algorithms in terms of static member functions.
The code base provides multiple models for each component with a sensible default behavior.
Users can subclass each model, implement their modifications and pass their type to \ic{LBM_CONFIG}.
All the aforementioned classes \ic{LBM_BLOCK}, \ic{LBM}, and \ic{State} are in fact templates with one parameter, \ic{LBM_CONFIG}.
The modifications of the components such as \ic{BC} or \ic{MACRO} may need to be designed together with the extensions of \ic{State}, especially when custom attributes are added to the \ic{DATA} component.
\subsection{Optimization remarks}
\label{sec:lbm:optimization}
\inline{permutation of the indices in the arrays}
\inline{MPI implementation: CUDA streams for overlapping computation and communication, NDArraySynchronizer, avoiding buffering for contiguous data ranges, pipelining, CUDA streams for copy kernels when buffering is needed}
author={Sharma, Keerti Vardhan and Straka, Robert and Tavares, Frederico Wanderley},
journal={Progress in Aerospace Sciences},
title={Current status of lattice {Boltzmann} methods applied to aerodynamic, aeroacoustic, and thermal flows},
year={2020},
issn={0376-0421},
pages={37},
volume={115},
abstract={We present a review of the evolutionary advancement of the lattice Boltzmann methods (LBM) and current status of their applications in dealing with diverse flow problems such as advection-diffusion, forced convection, natural convection, conjugate heat transfer, radiative heat transfer, thermal flow through porous media, phase transition and phase separation, aerodynamics and aeroacoustics, and thermal jet flows. We analyze and discuss various thermal LBM frameworks and methodologies since the time of Lattice Gas Automata (LGA) to recently developed methods to model simple and complex aerodynamic, aeroacoustic and thermal flow problems. Multi-speed (M-S) models, hybrid LBM models, and double distribution function LBM (DDF-LBM) models, along with their applications and improvements are discussed in details. Moreover, the recent developments in addressing various applications such as energy storage devices, systems with electro-thermo convection, aerodynamic flows, simulations of noise profiles in nozzles with and without chevrons, and thermal simulations of jet flow with LBM are presented in great detail.},