working on the MHFEM section (d4c755f5) · Commits · Jakub Klinkovský / Dissertation

content/MHFEM.tex

+81 −51

Original line number	Diff line number	Diff line
		\inline{Supporting paper: \cite{fucik:2019NumDwarf}, some performance results also in \cite{klinkovsky:2022meshes}}
		\inline{Od diplomky: MPI implementace pomocí TNL (multi-GPU, multi-CPU), Hypre solvers using TNL wrappers}
		The content of this section is based on the author's previous thesis \cite{klinkovsky:2017thesis} and the paper \cite{fucik:2019NumDwarf}, which describe \emph{NumDwarf}, a numerical scheme for the solution of a system of partial differential equations in a general coefficient form.
		\emph{NumDwarf} is based on the mixed-hybrid finite element method (MHFEM) and it was originally developed for simulating multicomponent transport phenomena in porous media (see \cite{solovsky2020,askar2021} for several recent applications), but it can be used for any problem whose governing equations can be written in a compatible form.
		Although the implementation of the numerical scheme has seen only minor modifications since the publication of the paper, the computational aspects of the author's solver have been greatly extended.
		Most notably, the solver supports MPI computations on distributed unstructured meshes.

		\emph{NumDwarf} was originally developed for simulating multicomponent transport phenomena in porous media, but the numerical scheme is implemented for a PDE system written in the form
		The section is organized as follows.
		\inline{describe the section organization; highlight what's new: MPI implementation using TNL (multi-GPU, multi-CPU), Hypre solvers using TNL wrappers}

		\subsection{General formulation}

		The numerical scheme is implemented for a PDE system written in the form
		\begin{equation} \label{eq:NumDwarf}
		\begin{split}
		\sum\limits_{j=1}^n N_{i,j} \pder{Z_j}{t} &+ \sum\limits_{j=1}^n \vec u_{i,j} \cdot \nabla Z_j
		@@ -10,14 +17,31 @@
		+ \sum\limits_{j=1}^n r_{i,j} Z_j = f_i
		\end{split}
		\end{equation}
		for $i = 1, \ldots, n$, where $\vec Z = [Z_1, \ldots, Z_n]^T$ is the vector of unknown functions depending on spatial and temporal coordinates and the symbols $\matrix N = [N_{i,j}]_{i,j=1}^n$, $\vec u = [\vec u_{i,j}]_{i,j=1}^n$, $\vec m = [m_i]_{i=1}^n$, $\matrix D = [\tensor D_{i,j}]_{i,j=1}^n$, $\vec w = [\vec w_i]_{i=1}^n$, $\vec a = [\vec a_{i,j}]_{i,j=1}^n$, $\vec r = [r_{i,j}]_{i,j=1}^n$, $\vec f = [f_i]_{i=1}^n$ are given coefficients.
		for $i = 1, \ldots, n$, where $\vec Z = [Z_1, \ldots, Z_n]^T$ is the vector of unknown functions depending on spatial coordinates $\vec x \in \Omega \subset \mathbb R^D$ and time $t \in [0, t_{\max}]$, where $D \in \{1, 2, 3\}$ denotes the spatial dimension, $\Omega$ is a polyhedral domain, and $t_{\max}$ is the final simulation time.
		The remaining symbols $\matrix N = [N_{i,j}]_{i,j=1}^n$, $\vec u = [\vec u_{i,j}]_{i,j=1}^n$, $\vec m = [m_i]_{i=1}^n$, $\matrix D = [\tensor D_{i,j}]_{i,j=1}^n$, $\vec w = [\vec w_i]_{i=1}^n$, $\vec a = [\vec a_{i,j}]_{i,j=1}^n$, $\vec r = [r_{i,j}]_{i,j=1}^n$, $\vec f = [f_i]_{i=1}^n$ are given problem-specific coefficients.
		Note that the coefficients may, in general, depend on the vector of unknown functions $\vec Z$.
		Coefficients dependent on $\vec Z$ in non-linear terms are evaluated by the numerical scheme using the values from the current time level, whereas the terms in \cref{eq:NumDwarf} containing $Z_j$ are treated implicitly.

		\emph{NumDwarf} can be used for any problem whose governing equations can be written in a compatible form, including multiphase and multicomponent flow in porous media \cite{solovsky2020,askar2021}.
		The idea behind the notation is that coefficients dependent on $\vec Z$ in non-linear terms are evaluated by the numerical scheme using the values from the current time level, whereas the symbols $Z_j$ in \cref{eq:NumDwarf} are treated implicitly.\todo{reference \cref{sec:mhfem:numerical scheme} or \cite{fucik:2019NumDwarf,klinkovsky:2017thesis} for details}

		The system of Equations~\eqref{eq:NumDwarf} must be supplemented by a suitable initial condition $\vec Z(\vec x, 0) = \vec Z_{\mathrm{ini}}(\vec x)$ and boundary conditions.
		\todo{Describe the boundary conditions in general.}
		The system of equations~\eqref{eq:NumDwarf} must be supplemented by a suitable initial condition
		\begin{equation}
		\vec Z(\vec x, 0) = \vec Z_{\mathrm{ini}}(\vec x), \qquad \forall x \in \Omega,
		\end{equation}
		and boundary conditions for all $i = 1, \ldots, n$, all $\vec x \in \partial\Omega$, and all $t \in (0, t_{\max})$.
		There are multiple options, such as the Dirichlet-type fixed-value condition
		\begin{subequations}
		\begin{alignat}{3}
		Z_i(\vec x, t) &= Z_i^{\mathcal D}, &\qquad& \forall x &\in \Gamma_i^{\mathcal D} \subset \partial\Omega,
		\intertext{or the Neumann-type fixed-flux condition}
		\vec v_i(\vec x, t) \cdot \vec n_{\partial\Omega}(\vec x) &= v_i^{\mathcal N}, &\qquad& \forall x &\in \Gamma_i^{\mathcal N} \subset \partial\Omega,
		\end{alignat}
		\end{subequations}
		where $\Gamma_i^{\mathcal D}$ and $\Gamma_i^{\mathcal N}$ denote parts of the domain boundary where the respective conditions are active, and $\vec n_{\partial\Omega}$ is the outward unit normal vector.
		Note that in case of non-linear problems where the coefficient $\vec m$ depends on the unknowns $\vec Z$, it may be necessary to specify fixed values of $Z_i$ even on parts of the domain boundary $\Gamma_i^{\mathcal N}$ where the Neumann-type condition is used to specify inflow, see \cite{fucik:2019NumDwarf,klinkovsky:2017thesis} for details.

		\subsection{Numerical scheme}
		\label{sec:mhfem:numerical scheme}

		\inline{Dát sem celé odvození schématu z diplomky \cite{klinkovsky:2017thesis} a paperu \cite{fucik:2019NumDwarf}?}

		\emph{NumDwarf} is based on the combination of the mixed-hybrid finite element method (MHFEM) and the discontinuous Galerkin method for spatial discretization, the Euler method for temporal discretization and the semi-implicit approach of the frozen coefficients method for the linearization in time.
		The scheme is stabilized with upwind and mass-lumping techniques and a modification of MHFEM described in \cite{fucik:2019NumDwarf} can be employed to solve problems with vanishing diffusion.
		@@ -26,7 +50,19 @@ The efficiently parallelized algorithm is locally conservative and leads to a li
		In this work, we use conforming unstructured homogeneous meshes denoted as $\mathcal K_h$ for the discretization of the domain and the lowest-order Raviart--Thomas--Nédélec function space $\mathbf{RTN}_0(\mathcal K_h)$.
		The set of faces of the mesh $\mathcal K_h$ will be denoted as $\mathcal E_h$ and the set of faces of an individual element $K \in \mathcal K_h$ will be denoted as $\mathcal E_K$.

		\subsubsection{Benchmarking methodology}
		\subsection{Generalized McWhorter--Sunada problem}

		In this section, we present the results of the generalized McWhorter--Sunada problem, a special case of two-phase flow in porous media with known semi-analytical solution \cite{mcwhorter1990exact,fucik2016multidimensional} which was used in \cite{fucik:2019NumDwarf} to verify the numerical scheme.

		\inline{problem description}

		\subsubsection{Verification results}

		For the purpose of this thesis, the computational domain, boundary conditions, and physical parameters were chosen identically to \cite{fucik:2019NumDwarf}, and also the meshes listed in \cref{tab:meshes} are the same as the triangular and tetrahedral meshes used for CPU computations in the original paper.

		\inline{add EOC tables}

		\subsection{Benchmarking methodology}

		\inline{same as \cref{sec:meshes:benchmarking methodology}, but includes MPI and $E\!f\!f$}

		@@ -39,22 +75,16 @@ In case of computations on the GPU, introducing a quantity similar to $E\!f\!f$
		Hence, the only comparable quantity is the GPU speed-up $GSp_\ell$ defined as the ratio between the computational times on CPU using $\ell$ cores and GPU.
		Similarly, parallel computations using multiple GPUs can be compared by means of the speed-up $Sp_\ell$ defined as the ratio between computational times using 1 and $\ell$ GPUs.

		\subsubsection{Generalized McWhorter--Sunada problem}

		\inline{rework the section for the thesis: rename to performance, add EOC tables in a different section}

		In order to demonstrate that the data structure is applicable to advanced numerical schemes, we implemented a scheme based on the mixed-hybrid finite element method \cite{fucik:2019NumDwarf} to solve the two-phase flow in porous media.
		In this section, we present the results of the generalized McWhorter--Sunada problem, a special case of two-phase flow in porous media with known semi-analytical solution \cite{mcwhorter1990exact,fucik2016multidimensional} which has been used in \cite{fucik:2019NumDwarf} to verify the numerical scheme.
		%The GPU computations in the original paper \cite{fucik:2019NumDwarf} were performed only on structured grids and this section extends the results also to GPU computations on unstructured meshes.
		%Hence, ...
		For the purpose of this paper, the computational domain, boundary conditions, and physical parameters were chosen identically to \cite{fucik:2019NumDwarf}, and also the meshes listed in \cref{tab:meshes} are the same as the triangular and tetrahedral meshes used for CPU computations in the original paper.

		Unlike the previous benchmark problems, the computational time of the generalized McWhorter--Sunada problem is governed by the resolution of sparse linear systems rather than operations involving the unstructured mesh.
		The computational time of the generalized McWhorter--Sunada problem is governed by the resolution of sparse linear systems rather than operations involving the unstructured mesh.
		The sparse linear systems are resolved by the BiCGstab solver with the Jacobi preconditioner.
		The solver can be executed either on a CPU or on a GPU.
		The CPU execution can use either OpenMP (shared memory, thread-based) or MPI (distributed memory, process-based) for parallelization.
		The GPU execution uses CUDA \cite{nvidia:cuda} for parallelization on a single GPU and it can be combined with MPI \cite{kraus:cuda-aware-mpi} for multi-GPU execution.
		Only results using MPI and/or CUDA for parallelization are included in this benchmark.

		\subsection{Benchmark results}

		The original paper \cite{fucik:2019NumDwarf} includes a performance study on structured grids and unstructured meshes using only CUDA and OpenMP parallelization (i.e., without MPI).
		The paper \cite{klinkovsky:2022meshes} contains an updated performance study on unstructured meshes using MPI and/or CUDA for parallelization across multiple CPU cores, multiple GPUs, or even multiple nodes.

		The computational times $CT$, parallel efficiency $E\!f\!f$ and GPU speed-ups $GSp_\ell$ for this benchmark are shown in \cref{tab:comptimes:mcwh2d_mpi,tab:comptimes:mcwh3d_mpi}.
		The GPU speed-ups $GSp_\ell$ rise as the meshes are refined and on the finest meshes the GPU computations are more than 20 times (2D) or 50 times (3D) faster than the sequential computations and more than 2 times (2D) or 9 times (3D) faster than CPU computations using 12 cores.
		@@ -62,28 +92,16 @@ When the CPU execution goes from 6 to 12 cores on the finest 2D mesh in \cref{ta
		This phenomenon may be caused by an increased cache size, since each core has its own L2 cache of size 1 MiB (see \cref{tab:hardware}) that cannot be utilized by other cores.
		In case of problems with memory requirements comparable to the cache size, such as the mentioned computation on the mesh 2D$^\triangle_5$, increasing the number of cores may improve performance more than proportionally to the number of cores, since more data may be readily available in the cache and accesses to the system memory may be avoided.

		% \begin{table}[bth]
		% \caption{
		% Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 2D.
		% }
		% \label{tab:comptimes:mcwh2d_omp}
		% \centering
		% \scalebox{0.84}{
		% \input{./data/mcwhdd/comptimes_openmp_2D.tex}
		% }
		% \end{table}
		%
		% \begin{table}[bth]
		% \caption{
		% Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 3D.
		% Computations on the mesh 3D$^\triangle_5$ were not performed for 1 and 2 CPU threads due to computational time limit imposed on the system.
		% }
		% \label{tab:comptimes:mcwh3d_omp}
		% \centering
		% \scalebox{0.84}{
		% \input{./data/mcwhdd/comptimes_openmp_3D.tex}
		% }
		% \end{table}
		\begin{table}[!tb]
		\caption{
		Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 2D.
		}
		\label{tab:comptimes:mcwh2d_omp}
		\centering
		\scalebox{0.84}{
		\input{./data/mcwhdd/comptimes_openmp_2D.tex}
		}
		\end{table}

		\begin{table}[!tb]
		\caption{
		@@ -96,6 +114,18 @@ In case of problems with memory requirements comparable to the cache size, such
		}
		\end{table}

		\begin{table}[!tb]
		\caption{
		Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 3D.
		Computations on the mesh 3D$^\triangle_5$ were not performed for 1 and 2 CPU threads due to computational time limit imposed on the system.
		}
		\label{tab:comptimes:mcwh3d_omp}
		\centering
		\scalebox{0.84}{
		\input{./data/mcwhdd/comptimes_openmp_3D.tex}
		}
		\end{table}

		\begin{table}[!tb]
		\caption{
		Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the MPI-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 3D.
		@@ -118,13 +148,6 @@ This is caused by the strong-scaling approach used in the benchmark: each comput
		%It can be expected that even higher speed-ups would be achieved for a problem on a finer mesh with more than $10^7$ degrees of freedom.
		The speed-ups could be improved by optimizing the linear system solver (BiCGstab) for multi-GPU computations, but this is out of scope of this paper.

		CPU computations distributed across multiple nodes using MPI are compared in \cref{tab:comptimes:mcwh3d_cpu_nodes}.
		This strong-scaling performance study was performed only on the finest tetrahedral mesh 3D$^\triangle_5$.
		The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 16 nodes at most could be employed in one MPI computation.
		The speed-up $Sp$ is calculated with respect to the computation using 12 cores, which was included in \cref{tab:comptimes:mcwh3d_mpi} already.
		It can be noticed that the speed-up is often higher than the number of CPUs, which may be again attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
		Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:comptimes:mcwh3d_gpu} with \cref{tab:comptimes:mcwh3d_cpu_nodes}, it can be seen that using 1 GPU leads to a faster computational time than when using 8 CPUs (4 nodes) and at least 32 CPUs (16 nodes) are necessary for a faster computational time than when using 4 GPUs.

		\begin{table}[tb]
		\caption{
		Comparison of computational times $CT \, [\text{s}]$ and speed-up $Sp_\ell$ of MPI-based GPU computations for the generalized McWhorter--Sunada problem in 2D.
		@@ -149,6 +172,13 @@ Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:co
		}
		\end{table}

		CPU computations distributed across multiple nodes using MPI are compared in \cref{tab:comptimes:mcwh3d_cpu_nodes}.
		This strong-scaling performance study was performed only on the finest tetrahedral mesh 3D$^\triangle_5$.
		The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 16 nodes at most could be employed in one MPI computation.
		The speed-up $Sp$ is calculated with respect to the computation using 12 cores, which was included in \cref{tab:comptimes:mcwh3d_mpi} already.
		It can be noticed that the speed-up is often higher than the number of CPUs, which may be again attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
		Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:comptimes:mcwh3d_gpu} with \cref{tab:comptimes:mcwh3d_cpu_nodes}, it can be seen that using 1 GPU leads to a faster computational time than when using 8 CPUs (4 nodes) and at least 32 CPUs (16 nodes) are necessary for a faster computational time than when using 4 GPUs.

		\begin{table}[tb]
		\caption{
		Comparison of computational times $CT \, [\text{s}]$ of distributed MPI-based CPU computations for the generalized McWhorter--Sunada problem on the finest tetrahedral mesh 3D$^\triangle_5$.