@@ -337,7 +337,7 @@ More formally, \cref{alg:parallel LBM} summarizes the steps with an explicitly h
In the latter case, the algorithm needs to be extended with data copies to and from the device memory due to input and output of relevant macroscopic quantities, but these steps are omitted in \cref{alg:parallel LBM} for clarity.
In order to utilize multiple GPUs or even multiple nodes for computing in a distributed fashion, an additional level of coarse-grained parallelism has to be introduced based on the domain decomposition approach.
The global lattice $\mathcal L_{\overline\Omega}$ is split into subdomains $\mathcal L_{\overline\Omega, j}$, $j \in\{1, \ldots, N_{\mathrm{part}}\}$ which are processed independently using $N_{\mathrm{proc}}\le N_{\mathrm{part}}$ independent processes.
The global lattice $\mathcal L_{\overline\Omega}$ is split into subdomains $\mathcal L_{\overline\Omega, j}$, $j \in\{1, \ldots, N_{\mathrm{parts}}\}$ which are processed independently using $N_{\mathrm{ranks}}\le N_{\mathrm{parts}}$ independent processes (MPI ranks).
Each subdomain is assigned to one MPI rank that is mapped to one coarse-grain computational unit, such as a CPU core, a CPU socket, or a GPU accelerator.
Each MPI rank is responsible for processing computations related to one or more subdomains that were assigned to it.
In order to obtain consistent results, it must be ensured that the values of discrete density distribution functions are propagated correctly across the interfaces between subdomains.
@@ -450,7 +450,7 @@ This includes pointers to the arrays allocated in the device memory, sizes of th
\end{figure}
The relation between the \ic{LBM_DATA} and \ic{LBM_BLOCK} classes is highlighted in \cref{fig:lbm:class diagram} along with additional classes that will be described later.
The \ic{LBM_BLOCK} objects representing individual subdomains managed by the current process (MPI rank) are aggregated in an \ic{std::vector} container in the class named \ic{LBM}.
The \ic{LBM_BLOCK} objects representing individual subdomains managed by the current MPI rank are aggregated in an \ic{std::vector} container in the class named \ic{LBM}.
The \ic{LBM} class provides member functions to manipulate all managed blocks collectively, as well as additional functionality related to LBM simulations, such as variables for time-stepping, or conversion between physical and lattice units.
From a high level point of view, each MPI rank creates one object of the \ic{State} class, which combines all components needed to run a simulation.
@@ -379,7 +379,7 @@ In case of GPU computations, the speed-up and efficiency are calculated relative
\subsection{Computational benchmark results}
\label{sec:mhfem:computational results}
The benchmark results are presented as strong-scaling studies for the finest triangular mesh 2D$^\triangle_5$ and finest tetrahedral mesh 3D$^\triangle_5$.
The benchmark results are presented as strongscaling studies for the finest triangular mesh 2D$^\triangle_5$ and finest tetrahedral mesh 3D$^\triangle_5$.
\Cref{tab:mhfem:comptimes:CPU 2D} shows the CPU computational times $CT$ and parallel efficiency $E\!f\!f$ for the 2D problem.
The computational times of the TNL solver using OpenMP and MPI are comparable, but the latter is faster in most cases.
The OpenMP efficiency even drops below 0.5 when all 24 cores of the two processors on a cluster node are used, which is most likely caused by inefficient memory allocations in the NUMA system and excessive communication over the interconnection between the processors.
@@ -396,7 +396,7 @@ The linear systems arising from the 3D problem took the Jacobi-preconditioned Bi
Additionally, \cref{tab:mhfem:comptimes:CPU 3D} includes results of CPU computations distributed across multiple nodes using MPI.
The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 14 nodes at most could be employed together in one MPI computation.
It can be noticed that while the parallel efficiency with respect to 1 core dropped down to about 0.65 on 24 cores, which is caused by the saturation of the memory bandwidth and automatic CPU core frequency scaling, it stays on this level or even increases slightly when multiple nodes are utilized for the computation.
This increase in efficiency may be attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
This increase in efficiency may be attributed to the increasing total cache size and the problem size staying constant in the strongscaling study.
Using Hypre for the solution of the linear systems lead to lower parallel efficiency compared to the simple Jacobi preconditioner used in TNL computations, which is an inevitable consequence of more complex communication patterns during preconditioning.
Even for the computations using the most nodes, Hypre is more than 4 times faster than TNL.
The Hypre library provides state--of--the--art linear system solvers and preconditioners targetting contemporary supercomputers consisting of thousands of nodes, but much larger problems would be necessary to efficiently utilize these computational resources.
@@ -385,8 +385,8 @@ In the simulations presented in this chapter, the BiCGstab method took at most 4
All three experimental configurations were simulated up to the final time $t_{\max}=\SI{100}{\second}$ in three different resolutions, hereafter denoted as RES-1, RES-2, and RES-3.
A reference lattice and mesh were generated for the initial resolution RES-1 and the space step is halved with each subsequent resolution.
%The simulations were computed on a system with two AMD EPYC 7543 processors and four Nvidia Tesla \mbox{A-100} GPU accelerators with NVLink interconnection.
The simulations were computed on a system with two AMD EPYC 7763 processors and eight Nvidia Tesla \mbox{A-100} GPU accelerators with NVLink interconnection.
%The simulations were computed on a system with two AMD EPYC 7543 processors and four Nvidia Tesla A100 GPU accelerators with NVLink interconnection.
The simulations were computed on a system with two AMD EPYC 7763 processors and eight Nvidia Tesla A100 GPU accelerators with NVLink interconnection.
See~\cref{tab:resolutions} for the characteristics of each resolution and computational resources needed for the simulations.
\begin{table}[ht]
@@ -405,7 +405,7 @@ See~\cref{tab:resolutions} for the characteristics of each resolution and comput
\end{tabular}
\caption{
Characteristics of each resolution used for presented simulations.
The computational times were achieved using 8 Nvidia Tesla \mbox{A-100} cards with NVLink interconnection.
The computational times were achieved using 8 Nvidia Tesla A100 cards with NVLink interconnection.
}
\label{tab:resolutions}
\end{table}
@@ -652,7 +652,7 @@ However, due to the adaptive time stepping strategy used in the coupled solver,
\end{tabular}
\caption{
Strong scaling of the coupled LBM-MHFEM solver for the scenario EX-1 in the resolution RES-2.
$N_{\mathrm{GPUs}}$ denotes the number of Nvidia Tesla \mbox{A-100} GPUs used in the computation, the Time column includes the computational time without initialization, the performance metric GLUPS stands for \textit{billions of lattice updates per second} and $E\!f\!f$ denotes the parallel efficiency.
$N_{\mathrm{GPUs}}$ denotes the number of Nvidia Tesla A100 GPUs used in the computation, the Time column includes the computational time without initialization, the performance metric GLUPS stands for \textit{billions of lattice updates per second} and $E\!f\!f$ denotes the parallel efficiency.
}
\label{tab:strong_scaling}
\end{table}
@@ -670,9 +670,9 @@ The model relies on experimental data for the input for boundary conditions: the
Based on the validation study presented in this paper, we can draw reasonable predictions about the flow and transport behavior inside the computational domain.
The performance of the coupled solver depends on the selected lattice and mesh sizes (i.e., spatial resolution) and the adaptively selected time steps.
The highest-resolution simulations presented in this paper, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 Nvidia Tesla \mbox{A-100} cards to simulate \SI{100}{\second} of physical time.
The highest-resolution simulations presented in this paper, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 Nvidia Tesla A100 cards to simulate \SI{100}{\second} of physical time.
The simulations in lower resolutions are not as accurate, but require less memory and shorter computational time compared to the highest resolution.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 Nvidia Tesla \mbox{A-100} cards.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 Nvidia Tesla A100 cards.
Scalability problems that are likely to occur on large-scale supercomputers (e.g., due to one-dimensional decomposition of the domain) were not investigated here due to the availability of computational resources.
The generalization of the domain decomposition procedure from \cref{sec:decomposition} to improve the scalability on large supercomputers may be subject of future research.