@@ -33,7 +33,7 @@ The velocity field $\vec v(\vec x, t)$ is first decomposed as
\end{equation}
where $\overline{\vec v}(\vec x)$ is the mean (time-averaged) field and $\vec v'(\vec x, t)$ is the velocity fluctuation.
The mean field $\overline{\vec v}$ can be set as desired for the initial or boundary condition and the fluctuation $\vec v'$ must be generated.
The algorithm described below is implemented in the \C++ language with CUDA acceleration and available as part of the TNL library \cite{oberhuber:2021tnl,TNL:documentation}.
The algorithm described below is implemented in the \C++ language with CUDA acceleration and available as part of the Template Numerical Library that was introduced in \cref{sec:TNL}.
Efficient data structures play an important role in high-performance computing, because they affect where each data is stored in the computer memory and how quickly it can be accessed.
Hence, data structures and algorithms often have to be designed collectively in order to utilize the most efficient access pattern on given hardware architecture.
In this chapter, we present several data structures which are flexible in the sense that they can be configured for a specific algorithm or hardware architecture.
In this chapter, we present several data structures implemented in the Template Numerical Library that was introduced in \cref{sec:TNL}, which are flexible in the sense that they can be configured for a specific algorithm or hardware architecture.
@@ -124,7 +124,7 @@ The software listed here can be divided into two main groups: the first containi
Packages from the latter group can often be interfaced with smaller libraries from the former group to supplement their native algorithms.
In both groups, we focus on open-source software with active development status.
Each package is listed with a brief description and overview of the main features.
In the final subsection, we summarize the current status in the Template Numerical Library.
In the final subsection, we summarize the current status in the Template Numerical Library that was introduced in \cref{sec:TNL}.
\subsection{Dedicated projects}
@@ -186,8 +186,9 @@ DUNE (Distributed and Unified Numerics Environment) \cite{bastian:2006DUNE} is a
Its ISTL (Iterative Solver Template Library) module provides implementations of several Krylov subspace solvers and preconditioners, including algebraic multigrid.
\subsection{Template Numerical Library}
\label{sec:iterative methods:TNL}
In this subsection, we summarize the current status related to the solution of linear systems in the TNL \cite{oberhuber:2021tnl,TNL:documentation} library.
In this subsection, we summarize the current status related to the solution of linear systems in the TNL library that was introduced in \cref{sec:TNL}.
Unless stated otherwise, all implementations support any matrix format implemented in TNL, distributed computing with MPI \cite{mpi:3.1}, and GPU acceleration using CUDA \cite{nvidia:cuda}.
TNL does not implement any direct method for the solution of large, dense or sparse, linear systems.
@@ -208,9 +209,9 @@ Solving a linear system on a distributed computing platform requires a suitable
The system is typically distributed in a row-wise manner such that the rows of the matrix and vectors are partitioned into non-overlapping ranges that are assigned to individual ranks.
A \emph{distributed data structure} is then used for combining the local data of each rank with information about the partitioning (e.g., assigning global indices).
While the implementation of a distributed vector is straightforward, data structures for a distributed sparse matrix that provides coupling between blocks of the partitioning may be designed in different ways.
This section describes data structures for distributed sparse matrices implemented in the TNL and Hypre libraries.
This section describes data structures for distributed sparse matrices implemented in the TNL (see \cref{sec:TNL}) and Hypre \cite{Hypre:library,Hypre:design1,Hypre:design2} libraries.
The implementation of a distributed sparse matrix in TNL~\cite{oberhuber:2021tnl} is closely bound to the distributed mesh and its MPI synchronizer described in \cref{sec:meshes:distributed}.
The implementation of a distributed sparse matrix in TNL is closely bound to the distributed mesh and its MPI synchronizer described in \cref{sec:meshes:distributed}.
Each row of the global matrix corresponds to one \emph{degree of freedom} associated to an entity of the global mesh and the matrix partitioning is determined by the partitioning of the mesh.
The data owned by a particular rank is stored in a single local matrix represented by a sparse matrix data structure in TNL.
Each row and column of the local matrix corresponds to an entity of the local mesh:
@@ -119,7 +119,7 @@ For example, the faces of a simplex are numbered such that each face has the sam
\label{appendix:mesh layers}
The inheritance diagram for the \ic{Mesh} class template is shown in \cref{fig:inheritance_1}.
The template parameters \ic{Config}, which represents the mesh configuration (see \subsubsecref{sec:meshes:configuration}), and \ic{Device}, which can be either \ic{TNL::Devices::Host} or \ic{TNL::Devices::Cuda}\todo{only reference the section in \cref{chapter:programming and architectures} listing all devices in TNL} and represents the \emph{device}where the mesh will be allocated, are propagated to all base class templates in the hierarchy.
The template parameters \ic{Config}, which represents the mesh configuration (see \subsubsecref{sec:meshes:configuration}), and \ic{Device}, which represents the \emph{device}in which memory space the mesh will be allocated (see \cref{sec:TNL:parallel concepts}), are propagated to all base class templates in the hierarchy.
The first base class template is \ic{StorageLayerFamily} which does not have any data members and its purpose is to provide access via tag dispatching to all individual storage layers.
The recursive inheritance is realized in the \ic{StorageLayer} class template which is parametrized by the entity dimension tag, \ic{DimTag<d>}, and comprises all data related to $d$-dimensional mesh entities.
The \ic{StorageLayerFamily} class template starts the inheritance with the dimension 0 and each storage layer inherits from the same class template, but with an incremented dimension tag.
The content of this section is based on the paper~\cite{klinkovsky:2022meshes} that presents a data structure for \emph{conforming unstructured homogeneous} meshes implemented in TNL~\cite{oberhuber:2021tnl}.
The initial implementation was designed by Vítězslav Žabka and later it was generalized and improved by the author into the presented form.
The content of this section is based on the paper~\cite{klinkovsky:2022meshes} that presents a data structure for \emph{conforming unstructured homogeneous} meshes implemented in the TNL library that was introduced in \cref{sec:TNL}.
The initial implementation was designed by Vítězslav Žabka and later it was generalized and improved by the author.
Since the publication of the paper in the journal, we have extended the data structure for \emph{polygonal} and \emph{polyhedral} meshes.
The implementation of this extension in TNL was conducted by Ján Bobot as part of his Master's Degree project~\cite{bobot:2022} under the supervision of the author.
The extensions for polygonal and polyhedral meshes are incorporated into the text in this section.
@@ -50,16 +50,12 @@ Additionally, this thesis includes modifications of the data structure necessary
We also describe an extension of the base data structure for \emph{distributed} meshes using the domain decomposition approach.
On the other hand, as a first step towards an efficient implementation, we disregard some features, such as adaptive refinement, which might be challenging to implement efficiently on GPUs.
The proposed data structure is implemented as part of the TNL library~\cite{oberhuber:2021tnl} using the \CPPfourteen language standard.
The OpenMP interface is used for intra-node CPU parallelization, GPU utilization relies on the CUDA framework \cite{nvidia:cuda} and MPI \cite{mpi:3.1} provides high-level parallelization for distributed CPU or GPU-based systems.
The TNL library provides\todo{remove this sentence and reference \cref{chapter:programming and architectures}} the user access to many data structures and algorithms implemented efficiently for these hardware architectures, including flexible parallel reduction, expression templates for vector operations, sparse matrix formats, iterative linear system solvers, and other building blocks commonly used in numerical solvers~\cite{oberhuber:2021tnl}.
The data structure is highly configurable via templates of the \C++ language, which allows avoiding the storage of unnecessary dynamic data.
Similarly to \cite{zayer2017gpu}, the internal memory layout is based on state--of--the--art sparse matrix formats \cite{monakov:2010sparse,oberhuber:2011rgcsr,oberhuber:2012improved} for the storage of incidence matrices.
This approach reduces complexity of the implementation thanks to code reuse and allows us to select a sparse matrix format that is optimized for given hardware architecture.
The proposed data structure can represent essentially arbitrary \emph{shapes} or \emph{topologies} of mesh entities, including a general polygonal or polyhedral entity.
Each topology is represented by its own type in the \C++ language.
The library currently features definitions of the following topologies: vertex, edge, triangle, quadrangle, tetrahedron, hexahedron, arbitrarily dimensional simplex, pyramid, wedge, general polygon and general polyhedron.
Furthermore, the data structure is highly configurable via templates of the \C++ language, allowing to avoid the storage of unnecessary dynamic data.
Similarly to \cite{zayer2017gpu}, the internal memory layout is based on state--of--the--art sparse matrix formats \cite{monakov:2010sparse,oberhuber:2011rgcsr,oberhuber:2012improved} for the storage of incidence matrices.
This approach reduces complexity of the implementation thanks to code reuse and allows us to select a sparse matrix format that is optimized for given hardware architecture.
The efficiency of the implemented data structure is verified using several benchmark problems.
We developed two benchmark problems based on simple parallel algorithms to evaluate different configurations of the data structure, to compare it with MOAB~\cite{tautges:2004moab}, and to demonstrate the efficiency of CUDA-based GPU parallelization compared to OpenMP-based CPU parallelization.
@@ -292,24 +288,14 @@ The data structure for the representation of a conforming unstructured mesh is i
Unfortunately, it is practically impossible to design a single efficient representation suitable for all numerical schemes.
Hence, compromises had to be made and this section summarizes our design goals and techniques we found to achieve them.
This section will cover only the important ideas behind the implementation of the \ic{Mesh} class template in the TNL library.
This section covers only the important ideas behind the implementation of the \ic{Mesh} class template in the TNL library.
The general design decisions related to parallel programming are described in \cref{sec:TNL}.
The full public interface as well as examples showing how to use the data structure can be found in the TNL project's documentation \cite{TNL:documentation}.
\subsubsection{Programming language}
\label{sec:meshes:programming language}
\inline{this might be merged with \cref{chapter:programming and architectures}}
The data structure is implemented in the \C++ language which is a state--of--the--art programming language for modern, high-performance applications.
Intra-node CPU parallelization is based on the OpenMP interface, support for GPU utilization is implemented using the CUDA framework \cite{nvidia:cuda}, and MPI \cite{mpi:3.1} provides high-level parallelization for distributed CPU as well as GPU-based systems.
We use \C++ templates extensively instead of virtual functions that have limited support in the CUDA framework \cite{nvidia:cuda}.
In general, this approach also leads to a more efficient code \cite{coplien:crtp,nidito:templates-vs-virtual} at the cost of increased work at compile-time.
Moreover, it encourages detection of programming errors at compile-time.
\subsubsection{Static configuration}
\label{sec:meshes:configuration}
In accord with the chosen programming paradigm, we aim to satisfy the high configurability requirement with \emph{static configuration}, i.e. configuration which is supplied and resolved at compile-time.
In accord with the programming paradigm used in TNL, we aim to satisfy the high configurability requirement with \emph{static configuration}, i.e. configuration which is supplied and resolved at compile-time.
An obvious disadvantage is that if a run-time selection of the configuration is needed, it is restricted to a finite set of configurations which were explicitly instantiated at compile-time.
The user configuration is supplied as a template parameter to the main class template \ic{Mesh} which will be described in the next section.
@@ -414,8 +400,8 @@ Unlike some other libraries such as OpenMesh \cite{bischoff:2002openmesh} where
Mesh entities can be identified by their \C++ type (i.e., topology) and global index, and adjacent entities are accessible using local indices.
This information can be used to define a natural mapping between the mesh entities and an appropriate data container storing user data separately from the mesh.
Depending on the specific application, a suitable container might be e.g. a (one-dimensional) array, a multi-dimensional array, an array of structures or a structure of arrays.
TNL provides implementations of one-dimensional and general multi-dimensional arrays (\ic{Array} and \ic{NDArray}) including full GPU support.\todo{reference \cref{sec:ndarray}}
The storage layout of the latter is configurable by a permutation of indices, which allows to easily switch the memory layout (e.g. row-major and column-major in case of two-dimensional arrays).
As described in \cref{sec:TNL}, TNL provides implementations of one-dimensional and general multi-dimensional arrays (\ic{Array} and \ic{NDArray}) including full GPU support.
The storage layout of the latter is configurable by a permutation of indices, which allows to easily switch the memory layout (see \cref{sec:ndarray} for details).
\subsubsection{Orientations of mesh entities}
@@ -464,7 +450,7 @@ The high-level documentation for programmers, including code examples, can be fo
\label{sec:impl:mesh initialization}
Each mesh object, i.e., instance of the \ic{Mesh<Config, Device>} type, has to be initialized prior to its first use.
The former class template parameter, \ic{Config}, refers to a specific configuration class (see \subsubsecref{sec:meshes:configuration}) and the latter class template parameter, \ic{Device}, denotes the \emph{device}where the mesh will be allocated.\todo{reference the section in \cref{chapter:programming and architectures} listing all devices in TNL}
The former class template parameter, \ic{Config}, refers to a specific configuration class (see \subsubsecref{sec:meshes:configuration}) and the latter class template parameter, \ic{Device}, denotes the \emph{device}in which memory space the mesh will be allocated (see \cref{sec:TNL:parallel concepts}).
In our implementation, the initialization process is sequential and done entirely on the CPU and then, the complete mesh can be transferred to the GPU using an overloaded assignment operator.
The initialization is done via a public member function \ic{init} of the \ic{Mesh} class template, which is usable only when \ic{TNL::Devices::Host} is specified as the class template parameter \ic{Device}.
@@ -681,6 +667,13 @@ Additional compiler flags for \ic{nvcc} were
% FIXME bug in the \ic macro, "--" is output as "-"
{\smaller\texttt{-}}\ic{-expt-relaxed-constexpr}{\smaller\texttt{-}}\ic{-expt-extended-lambda} and the GPU architecture was set to \ic{sm_70}.
\begin{table}[b]
\caption{Hardware used for the computation of benchmarks \cite{intel:2017XeonGold,nvidia:2017V100}. The cluster is operated by the \emph{Research Center for Informatics} (\url{http://rci.cvut.cz/}).}
\label{tab:hardware}
\centering
\input{./data/hardware_table_rci_v100.tex}
\end{table}
We compared computational times $CT$ for sequential (single-core) computations on the CPU, parallel multi-core computations on the CPU using OpenMP~\cite{OpenMP}, and parallel computations on one GPU accelerator using CUDA~\cite{nvidia:cuda} parallelization.
Note that the distributed mesh is not tested in this section.
To obtain statistically significant results, the computational time was taken as the average of 100 iterations of each computation.
@@ -697,13 +690,6 @@ Additionally, polygonal and polyhedral meshes were created from the aforemention
The duality means that cells of the dual mesh correspond to vertices of the primal mesh and vertices of the dual mesh correspond to cells of the primal mesh.
The properties of the dual meshes are summarized in \cref{tab:meshes poly}.
\begin{table}[bt]
\caption{Hardware used for the computation of benchmarks \cite{intel:2017XeonGold,nvidia:2017V100}. The cluster is operated by the \emph{Research Center for Informatics} (\url{http://rci.cvut.cz/}).}
\label{tab:hardware}
\centering
\input{./data/hardware_table_rci_v100.tex}
\end{table}
\begin{table}[bt]
\caption{Properties of the triangular (2D$^\triangle_\iota$) and tetrahedral (3D$^\triangle_\iota$) meshes used in the benchmarks: mesh size $h$ (radius of the largest sphere circumscribed around a cell), number of vertices $\#\mathcal V$, number of faces $\#\mathcal F$, and number of cells $\#\mathcal C$.}
\label{tab:meshes}
@@ -835,8 +821,8 @@ The benchmarks were performed on all meshes shown in \cref{tab:meshes}, but, for
Both benchmarks were performed with different configurations of the \ic{Mesh} class in TNL using different data types for template parameters: \ic{float} (\ic{f}) and \ic{double} (\ic{d}) for \ic{Real}, \ic{int} (\ic{i}) and \ic{long int} (\ic{l}) for \ic{GlobalIndex}, and \ic{short int} (\ic{s}) and \ic{int} (\ic{i}) for \ic{LocalIndex}.
For brevity, only four representative sets of types are shown in the tables.
It can be seen in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures} that selecting the smallest possible types for the \ic{Real}, \ic{GlobalIndex}, and \ic{LocalIndex} template parameters generally leads to the fastest computations on both CPU and GPU.
In case of MOAB~\cite{tautges:2004moab}, we used only one set of types (\ic{double} for vertex coordinates, \ic{long int} for global indices and \ic{int} for counts of adjacent entities), which was specified when installing MOAB.
It can be seen in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures} that selecting the smallest possible types for the \ic{Real}, \ic{GlobalIndex}, and \ic{LocalIndex} template parameters generally leads to faster computations on both CPU and GPU compared to the larger types.
Finally, we would like to remark that the performance of all benchmarks performed with the TNL library does not depend on other mesh configuration parameters, such as \ic{subentityStorage} or \ic{superentityStorage}, which can be used to disable the storage of unnecessary incidence matrices.
This behavior is not included in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures}, because it would lead to duplicate tables with the same values.
Hence, such configuration parameters influence the dynamic size required for the mesh representation as shown in \cref{tab:meshes:memory}, but not the performance of computations using the data structure.