The lattice Boltzmann method \cite{kruger:2017lattice} (LBM) is an alternative to traditional computational methods such as finite difference, finite volume, and finite element methods.
LBM can be formulated as a time-explicit scheme that can be easily parallelized \cite{kruger:2017lattice} and the advent of general-purpose computing on graphics processing units (GPGPU) made large-scale numerical simulations of turbulent flows feasible \cite{kang2013,geier2015cumulant,kumar2018,peng2018,wittmann2013,zakirov2021}.
For the simulation of air flow in a \todo{wind tunnel simulations are in the next chapter}wind tunnel, we could take advantage of the previously developed parallel LBM algorithms validated in \cite{fucik2019,eichler2021a,eichler2021b}.
For the simulation of air flow in a \todo{wind tunnel simulations are in the next chapter}wind tunnel, we could take advantage of the previously developed parallel LBM algorithms validated in \cite{fucik2019,eichler2021a,eichler2021b,eichler2022}.
Instead of directly solving a macroscopic PDE, such as the Navier--Stokes equations or a general advection-diffusion equation, LBM approximates the temporal evolution of macroscopic quantities such as density $\rho$~[\si{\kilogram\per\cubic\metre}], velocity $\vec v$~[\si{\metre\per\second}], and other variables (e.g., pressure, stress tensor, etc.) using probability moments of the density distribution function $f(\vec x, \vec\xi, t)$~[\si{\kilogram\second\cubed\per\metre\tothe{6}}] that represents the density of particles with velocity $\vec\xi=[\xi_x, \xi_y, \xi_z]^T$~[\si{\metre\per\second}] at position $\vec x$ and time $t$.
Based on the kinetic theory \cite{kruger:2017lattice}, the density distribution function $f$ evolves according to the Boltzmann transport equation
@@ -117,6 +117,7 @@ Unless noted otherwise, the values $\rho(\vec x, 0) = \hat\rho_0$ and $\vec v(\v
Different initialization schemes can be found in \cite{kruger:2017lattice}.
\subsection{Streaming schemes}
\label{sec:lbm:streaming schemes}
\subsubsection{Push and pull schemes with A-B pattern}
@@ -587,12 +588,39 @@ The algorithm tries to maximize the number of threads in the block up to \ic{max
\subsection{Results}
\inline{describe the benchmark problem}
\inline{define GLUPS}
\inline{describe the decompositions used in the benchmark: always $N_{\mathrm{proc}}= N_{\mathrm{part}}$}
\inline{define rounding to the nearest integer: $\round{\cdot}$}
Since the verification and validation of the LBM implementation was performed primarily by other research team members \cite{fucik2019,eichler2021a,eichler2021b,eichler2022}, this section presents only the performance evaluation, which is the author's original work.
The benchmark problem used in this section simulates a laminar flow in a rectangular duct for which an analytical solution can be derived \cite{white2005}.
The domain $\Omega=(0, L_x)\times(0, L_y)\times(0, L_z)$ has an inflow boundary on the left side, outflow boundary on the right side, and solid walls on the remaining sides.
The inflow velocity profile is prescribed as $\vec v(0, y, z)=[v_{\mathrm{an},x}(y, z), 0, 0]^T$, where $v_{\mathrm{an},x}$ denotes the $x$-component of the velocity in the analytical solution.
On the opposite side, a simple equilibrium outflow boundary condition is used and the no-slip condition via the bounce-back technique is used on the solid wall boundaries.
The kinematic viscosity is set as $\nu=\SI{1.5e-5}{\metre\squared\per\second}$ and the final simulation time is $t_{\max}=\SI{100}{\second}$.
The typical metric for performance evaluation of LBM is GLUPS (i.e., giga-LUPS, \textit{billions of lattice updates per second}).
The value of GLUPS is calculated by taking the number of lattice sites $N_x \times N_y \times N_z$, multiplying it by a fixed number of iterations, and dividing it by the computational time needed for the simulation to complete the given number of steps.
Running values of GLUPS can be calculated during the simulation by measuring the time needed to complete a predefined part of the simulation (e.g., a given number of iterations).
The values reported in this section were calculated by taking the total number of iterations in the whole simulation and the corresponding computational time (i.e., without the time spent in initialization, data output, etc.).
The GLUPS values are comparable among different lattice resolutions and hardware platforms used for the computations, but they are bound to the underlying configuration of LBM.
Most notably, the results reported in this section are specific to the D3Q27 velocity set and the cumulant collision operator.
\subsubsection{Comparison of streaming patterns}
As the first result, we compare the performance of two streaming patterns in our implementation: the A-B pull scheme and the A-A pattern (see \cref{sec:lbm:streaming schemes}).
The benchmark was computed on a fixed domain with $L_x = L_y = L_z =\SI{0.25}{\metre}$, using single or double precision, and without MPI distribution (i.e., using one MPI rank).
The same simulation was computed on several Nvidia GPU accelerators from the three latest architectures (Pascal, Volta/Turing, Ampère) and two product lines (GeForce, Tesla).
The solver was compiled with the \ic{nvcc} compiler version 11.8 using the flags \ic{-std=c++17 -O3 --use_fast_math} and the host compiler \ic{g++} version 11.3.0.
The results included in \cref{tab:lbm:AB vs AA} show that in all cases, the difference between the A-B pull scheme and the A-A pattern in terms of performance is negligible.
For the GeForce line, the values of GLUPS in single precision are approximately proportional to the maximum throughput of the GPU's global memory, while in double precision they are approximately proportional to the GPU's computational performance (GLUPS).
This suggests that LBM in single precision is a memory-bound application on these platforms, while LBM in double precision is compute-bound (note that GPUs in the GeForce line have severely limited performance in double precision \cite{nvidia:cuda}).
For the Tesla line, where the ratio between the performance in single and double precision is 2 for each GPU, the GLUPS values for single as well as double precision are approximately proportional to the maximum memory bandwidth.
For single precision, the proportion matches even the GeForce line.
LBM is therefore a memory-bound application on Tesla GPU accelerators in single as well as double precision.
%\inline{It seems that LBM is becoming more compute bound on A100 (it has $3\times$ higher attainable memory bandwidth compared to P100, but only $1.85\times$ higher compute performance, and GLUPS increased by a factor of 2.55 (SP) and 2.74 (DP)). But H100 is reported to have >3x higher compute performance compared to A100 and only 2x higher memory bandwidth, so this is not a long-term trend...}
Overall, based on the results in \cref{tab:lbm:AB vs AA}, it is evident that the A-A pattern provides an efficient way to reduce the memory requirements of LBM with negligible impact on the computational performance.
Hence, we the A-A pattern will be used in all LBM simulations presented hereafter.
\inline{Comparison of some streaming patterns: our A-B pull-scheme vs A-A pattern: \cref{tab:lbm:AB vs AA}}
\begin{table}[tb]
\caption{
Performance comparison of the A-B pull scheme with the A-A streaming pattern in single and double precision on various Nvidia GPU architectures.
@@ -603,7 +631,44 @@ The algorithm tries to maximize the number of threads in the block up to \ic{max
\input{./data/lbm/AB_vs_AA/table.tex}
\end{table}
\inline{weak scaling and strong scaling on the Karolina supercomputer}
\subsubsection{Scaling on the Karolina supercomputer}
The following results show the performance scaling of distributed LBM computations on the Karolina supercomputer~\cite{it4i:karolina}, operated by IT4Innovations in Ostrava.
The system comprises 72 accelerated compute nodes with hardware specifications listed in \cref{tab:hardware:Karolina}.
The solver was compiled with the \ic{nvcc} compiler version 11.6 using the flags \ic{-std=c++17 -O3 --use_fast_math} and the host compiler \ic{nvc++} version 22.2.
The MPI implementation was OpenMPI version 4.1.2.
All software components were loaded using the environment module \ic{OpenMPI/4.1.2-NVHPC-22.2-CUDA-11.6.0}.
\begin{table}[bt]
\caption{Hardware specifications of the accelerated compute nodes in the Karolina supercomputer \cite{it4i:karolina}, operated by IT4Innovations (\url{https://www.it4i.cz/}).}
\label{tab:hardware:Karolina}
\centering
\input{./data/hardware_table_Karolina.tex}
\end{table}
We performed a strong scaling analysis and two types of weak scaling analysis (explained below) using a 1D decomposition of the domain with $N_{\mathrm{ranks}}= N_{\mathrm{parts}}$ (i.e., each MPI rank managed one subdomain).
The number of GPUs used in the studies varied from 1 to 128 (i.e., 1 to 16 nodes) and each rank always used its own separate GPU.
The scaling is evaluated based on the GLUPS metric as defined above.
For parallel computations using multiple GPUs, we define the speed-up $Sp$ as the ratio between GLUPS using $N_{\mathrm{ranks}}$ and $1$ GPUs, and parallel efficiency $E\!f\!f$, which quantifies the scalability of the computation and is defined as the speed-up divided by the number of GPUs, i.e.
\begin{equation*}
E\!f\!f = \frac{\text{GLUPS for $N_{\mathrm{ranks}}$ GPUs}}{N_{\mathrm{ranks}}\times (\text{GLUPS for 1 GPU})} .
\end{equation*}
In the strong scaling analysis, the global lattice size is kept constant and with increasing $N_{\mathrm{ranks}}$, the subdomains become smaller.
\Cref{tab:lbm:strong scaling} shows the results on a $512\times512\times512$ lattice where the best strong scaling efficiency was achieved.
The corresponding problem size is approximately 16~GB in single precision and 32~GB in double precision, so it cannot be increased much further in order to fit into the 40~GB memory of a single GPU.
The physical dimensions of the domain are $L_x = L_y = L_z =\SI{0.25}{\metre}$.
According to the results from \cref{tab:lbm:strong scaling}, the computations are strongly scalable with perfect efficiency up to two nodes and then the efficiency starts to drop.
When only GPUs from a single node are used, all communication goes through NVLink, which is the fastest interconnection between GPUs that is available on the Karolina supercomputer.
When two nodes are used, two ranks (with IDs 7 and 8) must exchange data over the InfiniBand interconnection between nodes.
When more than two nodes are used, some nodes must exchange the same amount of data with multiple nodes.
For example, with four nodes, ranks 8 and 15 from node 1 exchange data with rank 7 from node 0 and rank 16 from node 2, respectively.
Also note that due to the 1D decomposition used in this strong scaling study, the computation-to-communication ratio decreases with increasing number of ranks, since the amount of communication per rank is constant, but the subdomains become smaller.
Hence, the communication cost becomes dominant as it eventually cannot be overlapped completely with the computation using \cref{alg:distributed LBM with overlapping} on \cpageref{alg:distributed LBM with overlapping}.
In the last computations using 128 ranks, the size of each subdomain is $4\times512\times512$ and the amount of communication corresponds to $1/2$ of the subdomain for all ranks except the first and the last where it corresponds to $1/4$.
The resulting parallel efficiency reported in \cref{tab:lbm:strong scaling} is 0.45 for single precision and 0.64 for double precision.
We suppose that this degenerate case could be significantly improved with a multidimensional decomposition of the domain.
\begin{table}[tb]
\caption{
Strong scaling in single and double precision on the Karolina supercomputer.
@@ -613,6 +678,14 @@ The algorithm tries to maximize the number of threads in the block up to \ic{max
In the first weak scaling analysis, the lattice size of each subdomain is kept constant and with increasing $N_{\mathrm{ranks}}$, the whole domain becomes larger.
Furthermore, the computation-to-communication ratio is kept constant as well.
\Cref{tab:lbm:weak scaling 1D} demonstrates almost perfect weak scaling for the case where the size of each subdomain is $256\times256\times256$ and the subdomains are arranged along the $x$-axis, resulting in global lattice with the size $256\, N_{\mathrm{ranks}}\times256\times256$.
We have also achieved the same parallel efficiency even with larger subdomains (such as $512\times512\times512$), but not with smaller subdomains (such as $128\times128\times128$) that are presumably too small to saturate dozens of GPU accelerators.
However, note that such weak scaling analysis is not practical, since the physical size of the domain is different in each case (here $L_x =0.25 N_{\mathrm{ranks}}\,\si{\metre}$ and $L_y = L_z =\SI{0.25}{\metre}$).
A more realistic weak scaling study is described below.
\begin{table}[tb]
\caption{
Weak scaling with 1D domain expansion in single and double precision on the Karolina supercomputer.
@@ -622,6 +695,22 @@ The algorithm tries to maximize the number of threads in the block up to \ic{max
In the second weak scaling analysis, the primary requirement is to keep the physical size of the domain constant.
In this study, we fix $L_x = L_y = L_z =\SI{0.25}{\metre}$, same as in the strong scaling study.
The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.
Since \cref{alg:LBM:CUDA thread block size} requires $N_y$ to be a multiple of 32, we scale the lattice as
where $\round{\cdot}$ denotes rounding to the nearest integer and $s$ is a scaling parameter.
For the results presented in \cref{tab:lbm:weak scaling 3D}, we chose $s =16$ for single precision in order to obtain the largest possible base lattice ($512\times512\times512$ for one rank) where the best weak scaling efficiency was achieved.
For double precision, we had to use a smaller base lattice with $s =8$ due to memory limitations.
The reason is that in this study, the communication size per rank is increasing (it is proportional to $N_y \times N_z$) and storage for data received from neighboring ranks must be allocated, so with $s =16$ and $N_{\mathrm{ranks}}\ge32$ the amount of memory needed per rank would be more than the GPU accelerators have available.
It can be noticed in \cref{tab:lbm:weak scaling 3D} that $E\!f\!f$ does not behave monotonically: for 1, 8, and 64 ranks it is close to 1, but otherwise it is significantly lower.
However, this problem is not due to the communication cost---the communication is completely overlapped with computation, but the computation itself is slower than it should be.
The reason behind this is that the CUDA thread block size $(1, B_y, B_z)$ selected by \cref{alg:LBM:CUDA thread block size} for given lattice size leads to a decrease in performance of the LBM algorithm compared to the optimal case with thread block size $(1, 256, 1)$, which requires $N_x$ to be a multiple of 256.
\begin{table}[tb]
\caption{
Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
@@ -631,3 +720,8 @@ The algorithm tries to maximize the number of threads in the block up to \ic{max
Note that all results presented in this section were computed without GPUDirect \todo{cite} which is still not fully functional on the Karolina supercomputer.
Hence, the effective bandwidth of the MPI communication was limited, because data could not be transferred from the GPU memory directly to the InfiniBand network interface and had to be buffered in the operating system memory instead.
Another problem is that in each computation, InfiniBand was used in multi-rail configuration \todo{cite} with two ports, but the same ports were shared by all ranks on a node and the other two InfiniBand ports remained completely unused.
We are not certain if this is a hardware or software problem and how it should be addressed.
author={Beneš, Michal and Eichler, Pavel and Fučík, Radek and Hrdlička, Jan and Klinkovský, Jakub and Kolář, Miroslav and Smejkal, Tomáš and Skopec, Pavel and Solovský, Jakub and Strachota, Pavel and Straka, Robert and Žák, Alexandr},
journal={Japan Journal of Industrial and Applied Mathematics},
title={Experimental and numerical investigation of air flow through the distributor plate in a laboratory-scale model of a bubbling fluidized bed boiler},
year={2022},
issn={1868-937X},
number={3},
pages={943--958},
volume={39},
abstract={In fluidized bed boilers, the distributor plate is a perforated metal plate which forms the bottom of the combustion chamber and separates it from the windbox. It prevents the fluidized granular material from falling through. At the same time, it allows an even distribution of the fluidization air which flows through the small holes. In this contribution, we consider an experimental model of the fluidized bed boiler and study the dependence of pressure drop at the distributor plate on the air flow rate. Numerical simulations of turbulent flow through the detailed three-dimensional geometry of the device are compared to experimental measurements. Two different simulation tools are used: our in-house high performance GPU solver based on the lattice Boltzmann method (LBM) and the ANSYS Fluent CFD software based on the finite volume method (FVM). The accuracy of both methods is strongly dependent on the mesh/lattice resolution inside (and in the vicinity of) the small holes of the distributor plate. When similar resolutions are used, FVM provides more accurate results than the original LBM scheme. However, the accuracy of LBM can be significantly improved by changing the parameters of the collision model so that it outperforms FVM. A simple convergence study of all involved numerical methods indicates improvement of the results with mesh/lattice refinement. In addition, LBM uses a structured lattice with the same resolution in the whole domain, which allows it to provide a detailed information on the non-uniformity of the velocity field above the distributor plate. The obtained results can be utilized to design a simplified model of the distributor plate for the purpose of complex CFD simulations of multiphase flow and combustion in fluidized bed boilers.},