@@ -6,6 +6,9 @@ Most notably, the solver supports MPI computations on distributed unstructured m
The section is organized as follows.
\inline{describe the section organization; highlight what's new: MPI implementation using TNL (multi-GPU, multi-CPU), Hypre solvers using TNL wrappers}
The original paper \cite{fucik:2019NumDwarf} includes a performance study on structured grids and unstructured meshes using only CUDA and OpenMP parallelization (i.e., without MPI).
The paper \cite{klinkovsky:2022meshes} contains an updated performance study on unstructured meshes using MPI and/or CUDA for parallelization across multiple CPU cores, multiple GPUs, or even multiple nodes.
\subsection{General formulation}
The numerical scheme is implemented for a PDE system written in the form
@@ -60,46 +63,104 @@ In this section, we present the results of the generalized McWhorter--Sunada pro
For the purpose of this thesis, the computational domain, boundary conditions, and physical parameters were chosen identically to \cite{fucik:2019NumDwarf}.
The simulations were computed on the meshes listed in \cref{tab:meshes} which are the same as the triangular and tetrahedral meshes used in \cite{fucik:2019NumDwarf}.
All simulations were computed in \ic{double} precision.
\inline{add EOC tables}
\subsection{Benchmarking methodology}
\inline{same as \cref{sec:meshes:benchmarking methodology}, but includes MPI and $E\!f\!f$}
We compared computational times $CT$ for sequential (single-core) computations on the CPU, parallel multi-core computations on the CPU using either OpenMP or MPI parallelization, and parallel computations on the GPU using CUDA parallelization.
The secondary quantities of interest for the comparison of sequential computations and parallel computations using $\ell$ CPU cores are speed-up $Sp_\ell$, which is defined as the ratio between computational times using $1$ and $\ell$ cores, and parallel efficiency $E\!f\!f$, which quantifies the scalability of the computation and is defined as the ratio between speed-up and number of cores, i.e.
The generalized McWhorter--Sunada problem was computed as a benchmark for several configurations of the solver.
All configurations were tested on the triangular as well as tetrahedral meshes with the same problem parameters as in the verification study described in the previous subsection.
The computational parameters that varied in the benchmark are:
\begin{itemize}
\item
\emph{Linear system solver}.
First, we used the BiCGstab method implemented in TNL with the Jacobi preconditioner.
Additionally, we used the BoomerAMG preconditioner \cite{Hypre:BoomerAMG} from the Hypre library \cite{Hypre:design1,Hypre:design2,Hypre:library} version 2.25.0.
The latter preconditioner is coupled with the BiCGstab implementation from the Hypre library.
Some other components of the solver also had to be changed due to compatibility with these two different BiCGstab implementations.
While the TNL implementation is not bound to any specific matrix format and the Ellpack format was used in the solver, the Hypre library requires the CSR format with specific conventions.\todo{Describe the conventions in \cref{chapter:linear systems} and leave a reference here.}
Hence, the sparse matrix assembly uses slightly different procedures in the two configurations.
\item
\emph{Hardware architecture}.
The benchmark was computed either solely on CPUs, or using GPU accelerators.
In the latter case, all parts of the MHFEM algorithm were executed directly on GPUs, so data transfers between the GPU and main system memory do not limit the performance.
Both TNL and Hypre rely on the CUDA framework \cite{nvidia:cuda} for GPU parallelization and they use the so-called CUDA-aware MPI \cite{kraus:cuda-aware-mpi} to avoid buffering of the data passed between multiple GPUs in the system memory.
\item
\emph{Programming framework for CPU parallelization}.
For the TNL implementation, we tested two approaches to CPU parallelization within a single node.
We used the shared-memory multi-thread approach via the OpenMP framework \cite{OpenMP} and also the general multi-process approach based on MPI \cite{mpi:3.1}.
The Hypre implementation was used only with the general MPI-based parallelization.
\end{itemize}
To ensure comparable results obtained with different preconditioners for the BiCGstab method, care must be taken to select suitable stopping criteria for the iterative algorithm.
The linear system solvers implemented in TNL use left-preconditioning, whereas the Hypre library implements solvers with right-preconditioning (see \cref{chapter:linear systems}\todo{make sure left- vs right-preconditioning is described there} for details).
Recall that right-preconditioned methods operate with the \emph{original} residual and right-hand-side vectors, whereas left-preconditioned methods operate with \emph{preconditioned} residual and right-hand-side vectors.
Hence, assuming the Jacobi preconditioner, the right-hand-side vector is scaled with the matrix diagonal entries during left-preconditioning and the stopping criterion based on the norm of the residual vector divided by the norm of the right-hand-side vector cannot be used with the same threshold as for the unpreconditioned or right-preconditioned system.
To remedy this problem, we assemble the linear system with scaled rows to obtain ones on the main diagonal of the matrix (i.e., the Jacobi preconditioning is applied during the matrix assembly).
This scales the elements of the right-hand-side vector to the same magnitude in both preconditioning techniques and allows the stopping criterion to be used with the same threshold (\np{e-11} in this study) in both cases.
Besides the stopping criterion, the BiCGstab method requires no other configuration.
The following common parameters were set for the BoomerAMG preconditioner:
\begin{itemize}
\item The PDE system size (i.e., $n$ in \cref{eq:NumDwarf}) was set to 2: \\\ic{HYPRE_BoomerAMGSetNumFunctions( precond, 2 );}
\item Aggressive coarsening with 1 level: \ic{HYPRE_BoomerAMGSetAggNumLevels( precond, 1 );}
\item For both 2D and 3D problems, the strength threshold was set to 0.25: \\\ic{HYPRE_BoomerAMGSetStrongThreshold( precond, 0.25 );}
\item The \ic{extended+i} interpolation operator with truncation to at most 4 elements per row \\\ic{HYPRE_BoomerAMGSetInterpType( precond, 6 );} and \ic{HYPRE_BoomerAMGSetPMaxElmts( precond, 4 );}
\end{itemize}
Additionally, the following parameters were set for CPU computations:
\begin{itemize}
\item The HMIS coarsening algorithm: \ic{HYPRE_BoomerAMGSetCoarsenType( precond, 10 );}
\item The truncation factor for the interpolation was set to 0.3: \\\ic{HYPRE_BoomerAMGSetTruncFactor( precond, 0.3 );}
\end{itemize}
The following parameters were set specifically for GPU computations:
\begin{itemize}
\item The PMIS coarsening algorithm: \ic{HYPRE_BoomerAMGSetCoarsenType( precond, 8 );}
\item The $\ell_1$-scaled Jacobi smoother: \ic{HYPRE_BoomerAMGSetRelaxType( precond, 18 );}
\item The 2-stage \ic{extended+e} interpolation in matrix-matrix form was used on the levels of aggressive coarsening: \ic{HYPRE_BoomerAMGSetAggInterpType( precond, 7 );}
\item The local interpolation transposes were saved to avoid sparse matrix--transpose--vector multiplications: \ic{HYPRE_BoomerAMGSetKeepTranspose( precond, 1 );}
\item The triple matrix product $\matrix R \matrix A \matrix P$ was replaced by two matrix products: \\\ic{HYPRE_BoomerAMGSetRAP2( precond, 1 );}
\end{itemize}
There are many other configurable parameters listed in the Hypre manual \cite{Hypre:manual} that might have an effect on the performance, but they were not investigated in this benchmark.
To reduce the computational time with the Hypre implementation, the BoomerAMG preconditioner is not updated in every time step and instead its setup is reused multiple times as long as the number of iterations necessary for the BiCGstab method to converge does not increase significantly.
Reusing preconditioners and recycling Krylov subspaces are common techniques for solving sequences of linear systems \cite{carr2021recycling,amritkar2015recycling,gaul2014recycling}, such as those arising from discretizations of evolutionary partial differential equations.
The computations were performed on the same hardware as described in \cref{sec:meshes:benchmarking methodology}, and also the software versions and compiler parameters were the same.
The MPI version is OpenMPI 4.0.5 with CUDA support and system-specific drivers.
Each computation was performed exactly once and the times spent in individual parts of the algorithm were measured.
The computational time $CT$ was measured since the solver was fully initialized until the final simulation time was reached.
The secondary quantities of interest for the evaluation of parallel computations are speed-up $Sp$, which is defined as the ratio between computational times using $1$ and $\ell$ compute units (e.g., CPU cores), and parallel efficiency $E\!f\!f$, which quantifies the scalability of the computation and is defined as the ratio between the speed-up and the number of compute units, i.e.
\begin{equation*}
E\!f\!f = \frac{\text{$CT$ for 1 core}}{\ell\times (\text{$CT$ for $\ell$cores})} .
E\!f\!f = \frac{\text{$CT$ for 1 unit}}{\ell\times (\text{$CT$ for $\ell$units})} .
\end{equation*}
In case of computations on the GPU, introducing a quantity similar to $E\!f\!f$ does not make sense, because the GPU cores are not independent as the CPU cores.
Hence, the only comparable quantity is the GPU speed-up $GSp_\ell$ defined as the ratio between the computational times on CPU using $\ell$ cores and GPU.
\todo{GPU speed-up is currently not shown in the tables}
Similarly, parallel computations using multiple GPUs can be compared by means of the speed-up $Sp_\ell$ defined as the ratio between computational times using 1 and $\ell$ GPUs.
The computational time of the generalized McWhorter--Sunada problem is governed by the resolution of sparse linear systems rather than operations involving the unstructured mesh.
The sparse linear systems are resolved by the BiCGstab solver with the Jacobi preconditioner.
The solver can be executed either on a CPU or on a GPU.
The CPU execution can use either OpenMP (shared memory, thread-based) or MPI (distributed memory, process-based) for parallelization.
The GPU execution uses CUDA \cite{nvidia:cuda} for parallelization on a single GPU and it can be combined with MPI \cite{kraus:cuda-aware-mpi} for multi-GPU execution.
In case of GPU computations, the speed-up and efficiency are calculated relative to the number of whole GPUs rather than the number of individual GPU cores, because the GPU cores are not independent.
\subsection{Benchmark results}
The original paper \cite{fucik:2019NumDwarf} includes a performance study on structured grids and unstructured meshes using only CUDA and OpenMP parallelization (i.e., without MPI).
The paper \cite{klinkovsky:2022meshes} contains an updated performance study on unstructured meshes using MPI and/or CUDA for parallelization across multiple CPU cores, multiple GPUs, or even multiple nodes.
The computational times $CT$, parallel efficiency $E\!f\!f$ and GPU speed-ups $GSp_\ell$ for this benchmark are shown in \cref{tab:comptimes:mcwh2d_mpi,tab:comptimes:mcwh3d_mpi}.
The GPU speed-ups $GSp_\ell$ rise as the meshes are refined and on the finest meshes the GPU computations are more than 20 times (2D) or 50 times (3D) faster than the sequential computations and more than 2 times (2D) or 9 times (3D) faster than CPU computations using 12 cores.
When the CPU execution goes from 6 to 12 cores on the finest 2D mesh in \cref{tab:comptimes:mcwh2d_mpi}, a super-linear speed-up manifested by an increased efficiency $E\!f\!f$ can be noticed.
This phenomenon may be caused by an increased cache size, since each core has its own L2 cache of size 1 MiB (see \cref{tab:hardware}) that cannot be utilized by other cores.
In case of problems with memory requirements comparable to the cache size, such as the mentioned computation on the mesh 2D$^\triangle_5$, increasing the number of cores may improve performance more than proportionally to the number of cores, since more data may be readily available in the cache and accesses to the system memory may be avoided.
CPU computations distributed across multiple nodes using MPI are compared in \cref{tab:comptimes:mcwh3d_cpu_nodes}.
This strong-scaling performance study was performed only on the finest tetrahedral mesh 3D$^\triangle_5$.
The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 16 nodes at most could be employed in one MPI computation.
The speed-up $Sp$ is calculated with respect to the computation using 12 cores, which was included in \cref{tab:comptimes:mcwh3d_mpi} already.
It can be noticed that the speed-up is often higher than the number of CPUs, which may be again attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:comptimes:mcwh3d_gpu} with \cref{tab:comptimes:mcwh3d_cpu_nodes}, it can be seen that using 1 GPU leads to a faster computational time than when using 8 CPUs (4 nodes) and at least 32 CPUs (16 nodes) are necessary for a faster computational time than when using 4 GPUs.
The benchmark results are presented as strong-scaling studies for the finest triangular mesh 2D$^\triangle_5$ and finest tetrahedral mesh 3D$^\triangle_5$.
\Cref{tab:mhfem:comptimes:CPU 2D} shows the CPU computational times $CT$ and parallel efficiency $E\!f\!f$ for the 2D problem.
The computational times of the TNL solver using OpenMP and MPI are comparable, but the latter is faster in most cases.
The OpenMP efficiency even drops below 0.5 when all 24 cores of the two processors on a cluster node are used, which is most likely caused by inefficient memory allocations in the NUMA system and excessive communication over the interconnection between the processors.
The Hypre computations are faster than TNL by a factor of 2.5 to 3, which manifests the importance of a strong preconditioner.
The linear systems arising from the 2D problem took the Jacobi-preconditioned BiCGstab method more than 500 iterations on average to converge, whereas the BoomerAMG-preconditioned BiCGstab converged after just 23 iterations on average.
The results of CPU computations for the 3D problem are shown in \cref{tab:mhfem:comptimes:CPU 3D}.
For a single node (up to 24 cores), the results are analogous to the 2D problem, except the differences between the methods are much larger.
For TNL computations, the OpenMP and MPI results are comparable only for 1 core (i.e., sequential computation) and MPI is significantly faster than OpenMP in all other cases.
This is surprising, because OpenMP operates on the shared memory level whereas the MPI approach is based on messages explicitly passed between the processes, but it demonstrates that a simple and user-friendly programming interface may incur significant performance penalty that is hard to analyze and eliminate.
The Hypre computations exhibit parallel efficiency similar to the MPI-based TNL computations, but the computational times are almost 5 times shorter.
The linear systems arising from the 3D problem took the Jacobi-preconditioned BiCGstab about 530 iterations on average to converge, whereas the BoomerAMG-preconditioned BiCGstab converged after just 17 iterations on average.
Additionally, \cref{tab:mhfem:comptimes:CPU 3D} includes results of CPU computations distributed across multiple nodes using MPI.
The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 14 nodes at most could be employed together in one MPI computation.
It can be noticed that while the parallel efficiency with respect to 1 core dropped down to about 0.65 on 24 cores, which is caused by the saturation of the memory bandwidth and automatic CPU core frequency scaling, it stays on this level or even increases slightly when multiple nodes are utilized for the computation.
This increase in efficiency may be attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
Using Hypre for the solution of the linear systems lead to lower parallel efficiency compared to the simple Jacobi preconditioner used in TNL computations, which is an inevitable consequence of more complex communication patterns during preconditioning.
Even for the computations using the most nodes, Hypre is more than 4 times faster than TNL.
The Hypre library provides state--of--the--art linear system solvers and preconditioners targetting contemporary supercomputers consisting of thousands of nodes, but much larger problems would be necessary to efficiently utilize these computational resources.
\begin{table}[!tb]
\caption{
@@ -119,15 +180,14 @@ Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:co
\input{./data/mcwhdd/comptimes_cpu_3D.tex}
\end{table}
The comparison of computational times $CT$ and speed-ups $Sp_\ell$ for benchmarks involving multi-GPU computations is shown in \cref{tab:comptimes:mcwh2d_gpu,tab:comptimes:mcwh3d_gpu}.
The results of GPU computations for both 2D and 3D problems are shown in \cref{tab:mhfem:comptimes:GPU}.
Note that each MPI rank manages its dedicated GPU, so each computation used as many GPUs as there were MPI ranks.
It can be observed in \cref{tab:comptimes:mcwh2d_gpu} that using multiple GPUs does not lead to faster computations (the speed-ups $Sp_\ell$ are less than~1).
This can be attributed to the latency of communication between multiple GPUs, which is high compared to the actual computation, because the triangular meshes used in the benchmark do not provide enough degrees of freedom to fully utilize the computational power of GPUs.
On the other hand, the tetrahedral meshes used in the benchmark provide more degrees of freedom, so the speed-ups $Sp_\ell$ in \cref{tab:comptimes:mcwh3d_gpu} rise above~1.
The speed-up for 4 ranks is approximately 3.3 on the finest tetrahedral mesh, which is rather small compared to the ideal case ($Sp_4=4$).
This is caused by the strong-scaling approach used in the benchmark: each computation was repeated on the same mesh using different numbers of MPI ranks and subsequently every single GPU was becoming less and less utilized with the increasing number of MPI ranks.
%It can be expected that even higher speed-ups would be achieved for a problem on a finer mesh with more than $10^7$ degrees of freedom.
The speed-ups could be improved by optimizing the linear system solver (BiCGstab) for multi-GPU computations, but this is out of scope of this paper.
It can be noticed that in the 2D case, using multiple GPUs does not lead to faster computations (the speed-ups $Sp$ for the mesh 2D$^\triangle_5$ are less than~1).
This can be attributed to the latency of communication between multiple GPUs, which is high compared to the actual computation, because the triangular meshes used in the benchmark do not provide enough degrees of freedom to fully utilize the computational power of the GPUs.
On the other hand, the tetrahedral meshes used in the benchmark provide more degrees of freedom, so the speed-ups $Sp$ for the mesh 3D$^\triangle_5$ rise above~1.
Comparing the GPU computational times from \cref{tab:mhfem:comptimes:GPU} with the corresponding CPU computations from \cref{tab:mhfem:comptimes:CPU 2D,tab:mhfem:comptimes:CPU 3D}, it can be seen that using GPU acceleration brings significant advantage for the simple Jacobi preconditioner in TNL, but not so significant for the BoomerAMG preconditioner from the Hypre library.
For the TNL solver and the 3D problem, using 1~GPU leads to a faster computational time than when using 8~CPUs (4~nodes) and at least 32~CPUs (16~nodes) are necessary for a faster computational time than when using 4~GPUs.
The GPU computations with Hypre are only slightly faster than TNL computations, only 2~CPUs (1~node) are necessary for Hypre to be competitive with 1~GPU and using Hypre on 8~CPUs (4~nodes) is faster than the fastest computation using 4~GPUs.
\begin{table}[tb]
\caption{
@@ -139,14 +199,13 @@ The speed-ups could be improved by optimizing the linear system solver (BiCGstab
\input{./data/mcwhdd/comptimes_gpu.tex}
\end{table}
The final result shown in this section is the breakdown of the overall computational times from \cref{tab:comptimes:mcwh3d_cpu_nodes} on the finest tetrahedral mesh 3D$^\triangle_5$.
In \cref{tab:comptimes:portions} it can be observed that the major portion corresponds to the sparse linear system solver (BiCGstab) which involves two sparse matrix--vector multiplications, several dot products, and other BLAS-1 operations in every iteration \cite{saad:2003iterative}.
The former two types are collective operations which introduce a communication latency when MPI is used.
Before each sparse matrix--vector multiply, the ghost regions of the input vector need to be synchronized with \ic{DistributedMeshSynchronizer} and the contribution to the total computational time is included in the fourth row in \cref{tab:comptimes:portions}.
The communication for dot products is implemented using the \ic{MPI_Allreduce} function and its contribution to the computational time is included in the fifth row in \cref{tab:comptimes:portions}.
The remaining operations, such as the sparse matrix assembly and various operations for the MHFEM discretization contribute only a small portion to the total computational time, which is even smaller than the communication time required for the linear system solver.
The final result shown in this section is the breakdown of the overall computational times from \cref{tab:mhfem:comptimes:CPU 3D} on the finest tetrahedral mesh 3D$^\triangle_5$.
In \cref{tab:mhfem:portions:TNL,tab:mhfem:portions:Hypre} it can be observed that regardless of the preconditioner used, the major portion (over 90\thinspace\%) corresponds to the linear system solver (BiCGstab) rather than operations involving the unstructured mesh.
Recall that the BoomerAMG preconditioner is not updated in every time step, so its contribution to the total computational time stays below 1\thinspace\%.
The other entries in the tables correspond to various operations for the MHFEM discretization (\emph{MHFEM routines}), data synchronizations in the ghost regions with \ic{DistributedMeshSynchronizer} (\emph{MPI communication of mesh data}), and the sparse matrix assembly.
The MHFEM routines involve the same operations in both TNL and Hypre implementations and thus take the same time, but they are relatively more significant in the Hypre implementation where the linear system solver is faster.
The sparse matrix assembly for Hypre takes slightly longer time than for TNL, but it is still faster than the MHFEM routines.
Note that the linear system solver also includes a considerable amount of MPI communication which is segregated in its own contribution in \cref{tab:mhfem:portions:TNL}, but included in the linear system solver entry in \cref{tab:mhfem:portions:Hypre}.
@@ -738,7 +738,7 @@ Additional compiler flags for \ic{nvcc} were
% FIXME bug in the \ic macro, "--" is output as "-"
{\smaller\texttt{-}}\ic{-expt-relaxed-constexpr}{\smaller\texttt{-}}\ic{-expt-extended-lambda} and the GPU architecture was set to \ic{sm_70}.
We compared computational times $CT$ for sequential (single-core) computations on the CPU, parallel multi-core computations on the CPU using OpenMP, and parallel computations on one GPU accelerator using CUDA parallelization.
We compared computational times $CT$ for sequential (single-core) computations on the CPU, parallel multi-core computations on the CPU using OpenMP~\cite{OpenMP}, and parallel computations on one GPU accelerator using CUDA~\cite{nvidia:cuda} parallelization.
Note that the distributed mesh is not tested in this section.
To obtain statistically significant results, the computational time was taken as the average of 100 iterations of each computation.
The secondary quantity of interest for the comparison of sequential computations and parallel computations using $\ell$ CPU cores is the speed-up $Sp_\ell$, which is defined as the ratio between computational times using $1$ and $\ell$ cores.