The content of this section is based on the author's previous thesis \cite{klinkovsky:2017thesis} and the paper \cite{fucik:2019NumDwarf}, which describe \emph{NumDwarf}, a numerical scheme for the solution of a system of partial differential equations in a general coefficient form.
\emph{NumDwarf} is based on the mixed-hybrid finite element method (MHFEM) and it was originally developed for simulating multicomponent transport phenomena in porous media (see \cite{solovsky2020,askar2021} for several recent applications), but it can be used for any problem whose governing equations can be written in a compatible form.
\emph{NumDwarf} is based on the mixed-hybrid finite element method (MHFEM) \cite{brezzi:1991mixed}and it was originally developed for simulating multicomponent transport phenomena in porous media (see \cite{solovsky2020,askar2021} for several recent applications), but it can be used for any problem whose governing equations can be written in a compatible form.
Although the implementation of the numerical scheme has seen only minor modifications since the publication of the paper, the computational aspects of the author's solver have been greatly extended.
Most notably, the solver supports MPI computations on distributed unstructured meshes.
The section is organized as follows.
\inline{describe the section organization; highlight what's new: MPI implementation using TNL (multi-GPU, multi-CPU), Hypre solvers using TNL wrappers}
Compared to the original computational performance study from \cite{fucik:2019NumDwarf}, which was computed on structured grids as well as unstructured meshes but using only CUDA and OpenMP parallelization, the presented results were computed only on unstructured meshes, but include MPI-based distributed computations using multiple GPUs and multiple CPU nodes.
This extends the updated performance study from \cite{klinkovsky:2022meshes} with a comparison between OpenMP and MPI for multi-core CPU parallelization on a single node, and a comparison between different solvers for sparse linear systems.
The original paper \cite{fucik:2019NumDwarf} includes a performance study on structured grids and unstructured meshes using only CUDA and OpenMP parallelization (i.e., without MPI).
The paper \cite{klinkovsky:2022meshes} contains an updated performance study on unstructured meshes using MPI and/or CUDA for parallelization across multiple CPU cores, multiple GPUs, or even multiple nodes.
The section is organized as follows.
First, we introduce the general formulation of the PDE system solved by \emph{NumDwarf} in \cref{sec:mhfem:formulation} and summarize the numerical scheme in \cref{sec:mhfem:numerical scheme}.
In \cref{sec:mhfem:mcwhdd}, we present the results of a benchmark problem that was used to verify the numerical scheme and implementation by means of comparison against known semi-analytical solutions.
The same benchmark problem was used to also evaluate the performance of the solver using various computational parameters.
The benchmarking methodology is described in \cref{sec:mhfem:benchmarking methodology} and the benchmark results are presented in \cref{sec:mhfem:computational results}.
\subsection{General formulation}
\label{sec:mhfem:formulation}
The numerical scheme is implemented for a PDE system written in the form
\begin{equation}\label{eq:NumDwarf}
@@ -23,7 +27,8 @@ The numerical scheme is implemented for a PDE system written in the form
for $i =1, \ldots, n$, where $\vec Z =[Z_1, \ldots, Z_n]^T$ is the vector of unknown functions depending on spatial coordinates $\vec x \in\Omega\subset\mathbb R^D$ and time $t \in[0, t_{\max}]$, where $D \in\{1, 2, 3\}$ denotes the spatial dimension, $\Omega$ is a polyhedral domain, and $t_{\max}$ is the final simulation time.
The remaining symbols $\matrix N =[N_{i,j}]_{i,j=1}^n$, $\vec u =[\vec u_{i,j}]_{i,j=1}^n$, $\vec m =[m_i]_{i=1}^n$, $\matrix D =[\tensor D_{i,j}]_{i,j=1}^n$, $\vec w =[\vec w_i]_{i=1}^n$, $\vec a =[\vec a_{i,j}]_{i,j=1}^n$, $\vec r =[r_{i,j}]_{i,j=1}^n$, $\vec f =[f_i]_{i=1}^n$ are given problem-specific coefficients.
Note that the coefficients may, in general, depend on the vector of unknown functions $\vec Z$.
The idea behind the notation is that coefficients dependent on $\vec Z$ in non-linear terms are evaluated by the numerical scheme using the values from the current time level, whereas the symbols $Z_j$ in \cref{eq:NumDwarf} are treated implicitly.\todo{reference \cref{sec:mhfem:numerical scheme} or \cite{fucik:2019NumDwarf,klinkovsky:2017thesis} for details}
The idea behind the notation is that coefficients dependent on $\vec Z$ in non-linear terms are evaluated by the numerical scheme using the values from the current time level, whereas the symbols $Z_j$ in \cref{eq:NumDwarf} are treated implicitly.
See \cite{fucik:2019NumDwarf,klinkovsky:2017thesis} for details.
The system of equations~\eqref{eq:NumDwarf} must be supplemented by a suitable initial condition
\begin{equation}
@@ -44,16 +49,19 @@ Note that in case of non-linear problems where the coefficient $\vec m$ depends
\subsection{Numerical scheme}
\label{sec:mhfem:numerical scheme}
\inline{Dát sem celé odvození schématu z diplomky \cite{klinkovsky:2017thesis} a paperu \cite{fucik:2019NumDwarf}?}
Since the formulation of the numerical scheme has seen no modifications since the works \cite{klinkovsky:2017thesis,fucik:2019NumDwarf}, the full derivation of the numerical scheme is omitted in this thesis and readers are referred to the aforementioned publications for details.
In the following paragraphs, we briefly summarize the numerical approach for the purpose of applications presented in this work.
\emph{NumDwarf} is based on the combination of the mixed-hybrid finite element method (MHFEM) and the discontinuous Galerkin method for spatial discretization, the Euler method for temporal discretization and the semi-implicit approach of the frozen coefficients method for the linearization in time.
\emph{NumDwarf} is based on the combination of the mixed-hybrid finite element method (MHFEM) \cite{brezzi:1991mixed}and the discontinuous Galerkin method \cite{arnold:2002}for spatial discretization, the Euler method for temporal discretization and the semi-implicit approach of the frozen coefficients method for the linearization in time.
The scheme is stabilized with upwind and mass-lumping techniques and a modification of MHFEM described in \cite{fucik:2019NumDwarf} can be employed to solve problems with vanishing diffusion.
The efficiently parallelized algorithm is locally conservative and leads to a linear system with a positive-definite matrix, allowing to use unstructured meshes.
The scheme is locally conservative and leads to the solution of one linear system per time step.
The matrix of each linear system is generally non-symmetric due to the upwind stabilization.
In this work, we use conforming unstructured homogeneous meshes denoted as $\mathcal K_h$ for the discretization of the domain and the lowest-order Raviart--Thomas--Nédélec function space $\mathbf{RTN}_0(\mathcal K_h)$.
The set of faces of the mesh $\mathcal K_h$ will be denoted as $\mathcal E_h$ and the set of faces of an individual element $K \in\mathcal K_h$ will be denoted as $\mathcal E_K$.
In this section, we present the results of the generalized McWhorter--Sunada problem, a special case of two-phase flow in porous media with known semi-analytical solution \cite{mcwhorter1990exact,fucik2016multidimensional} which was used in \cite{fucik:2019NumDwarf} to verify the numerical scheme.
@@ -87,6 +95,7 @@ All simulations were computed in \ic{double} precision.
\end{table}
\subsection{Benchmarking methodology}
\label{sec:mhfem:benchmarking methodology}
The generalized McWhorter--Sunada problem was computed as a benchmark for several configurations of the solver.
All configurations were tested on the triangular as well as tetrahedral meshes with the same problem parameters as in the verification study described in the previous subsection.
@@ -144,6 +153,7 @@ The following parameters were set specifically for GPU computations:
There are many other configurable parameters listed in the Hypre documentation \cite{Hypre:documentation} that might have an effect on the performance, but they were not investigated in this benchmark.
To reduce the computational time with the Hypre implementation, the BoomerAMG preconditioner is not updated in every time step and instead its setup is reused multiple times as long as the number of iterations necessary for the BiCGstab method to converge does not increase significantly.
\todo{add more details on the heuristics: increase by 5 iterations since the last preconditioner update triggers the preconditioner update}
Reusing preconditioners and recycling Krylov subspaces are common techniques for solving sequences of linear systems \cite{carr2021recycling,amritkar2015recycling,gaul2014recycling}, such as those arising from discretizations of evolutionary partial differential equations.
The computations were performed on the same hardware as described in \cref{sec:meshes:benchmarking methodology}, and also the software versions and compiler parameters were the same.
@@ -157,7 +167,8 @@ The secondary quantities of interest for the evaluation of parallel computations
\end{equation*}
In case of GPU computations, the speed-up and efficiency are calculated relative to the number of whole GPUs rather than the number of individual GPU cores, because the GPU cores are not independent.
\subsection{Benchmark results}
\subsection{Computational benchmark results}
\label{sec:mhfem:computational results}
The benchmark results are presented as strong-scaling studies for the finest triangular mesh 2D$^\triangle_5$ and finest tetrahedral mesh 3D$^\triangle_5$.
\Cref{tab:mhfem:comptimes:CPU 2D} shows the CPU computational times $CT$ and parallel efficiency $E\!f\!f$ for the 2D problem.
author={Arnold, Douglas N. and Brezzi, Franco and Cockburn, Bernardo and Marini, L. Donatella},
journal={SIAM Journal on Numerical Analysis},
title={Unified analysis of discontinuous {Galerkin} methods for elliptic problems},
year={2002},
number={5},
pages={1749--1779},
volume={39},
abstract={We provide a framework for the analysis of a large class of discontinuous methods for second-order elliptic problems. It allows for the understanding and comparison of most of the discontinuous Galerkin methods that have been proposed over the past three decades for the numerical treatment of elliptic problems.},