copy-edit (47757446) · Commits · Jakub Klinkovský / Dissertation

content/LBM.tex

+2 −2

Original line number	Diff line number	Diff line
		@@ -12,7 +12,7 @@ Based on the kinetic theory \cite{kruger:2017lattice}, the density distribution
		where $\rho$~[\si{\kilogram\per\cubic\metre}] is the mass density, $\vec F = [F_x, F_y, F_z]^T$~[\si{\newton\per\cubic\metre}] represents the external body force density acting on the mass and $\mathcal C$~[\si{\kilogram\second\squared\per\metre\tothe{6}}] denotes a general collision operator that depends on the distribution function and other parameters (e.g., relaxation times and equilibrium distribution).
		LBM follows from the discretization of \cref{eq:lbm:BTE} in space, velocity space, and time.

		\section{Components of the lattice Boltzmann method}
		\section{Components of the method}

		The detailed derivation of LBM is not within the scope of this thesis.
		Instead, we review the components needed to describe the implementation of the method and refer the reader to \cite{kruger:2017lattice} for further details.
		@@ -664,7 +664,7 @@ The MPI implementation was OpenMPI version 4.1.2.
		All software components were loaded using the environment module \ic{OpenMPI/4.1.2-NVHPC-22.2-CUDA-11.6.0}.

		\begin{table}[bt]
		\caption{Hardware specifications of the accelerated compute nodes in the Karolina supercomputer \cite{it4i:karolina}, operated by IT4Innovations (\url{https://www.it4i.cz/}).}
		\caption{Hardware specifications of accelerated compute nodes in the Karolina supercomputer \cite{it4i:karolina}, operated by IT4Innovations (\url{https://www.it4i.cz/}).}
		\label{tab:hardware:Karolina}
		\centering
		\input{./data/hardware_table_Karolina.tex}

content/document.tex

+2 −2

Original line number	Diff line number	Diff line
		@@ -5,8 +5,8 @@
		\inline{TODO}
		\inline{General motivation for polyhedral meshes in scientific computing -- advantages w.r.t. tetrahedral or hexahedral meshes: some ideas are in \url{https://www.symscape.com/polyhedral-tetrahedral-hexahedral-mesh-comparison}}

		Mathematical modelling of fluid dynamics has many ecological, medical and industrial applications and it is one of the central research topics investigated at the Department of Mathematics, FNSPE CTU in Prague in collaboration with prominent domestic as well as foreign institutions.
		In order to model complex natural processes governing the behaviour of fluids, it is often necessary to run high-resolution simulations on large computational clusters or supercomputers.
		Mathematical modeling of fluid dynamics has many ecological, medical and industrial applications and it is one of the central research topics investigated at the Department of Mathematics, FNSPE CTU in Prague in collaboration with prominent domestic as well as foreign institutions.
		In order to model complex natural processes governing the behavior of fluids, it is often necessary to run high-resolution simulations on large computational clusters or supercomputers.
		However, using the facilities for high performance computing efficiently is non-trivial, as it requires careful management of data in the computer memory and appropriate division of the task between all available processing units.
		Especially when designing algorithms for systems with GPU accelerators, which provide significantly more processing units as well as memory levels compared to traditional computational systems, the specifics of the hardware architecture have to be considered.

content/linear_systems.tex

+0 −2

Original line number	Diff line number	Diff line
		\inline{No supporting paper since the CWY-GMRES paper was discarded...}

		This chapter is dedicated to methods for the solution of large and sparse systems of linear equations.
		It does not contain any novel work on the methods by the author, but provides an overview of the state of the art.
		High performance solvers for sparse linear systems are an important building block for the numerical methods developed in the following chapter.

content/meshes/meshes.tex

+14 −14

Original line number	Diff line number	Diff line
		@@ -185,7 +185,7 @@ It can be observed that $I_{d_2, d_1} = I_{d_1, d_2}^T$.

		Additional connectivity properties can be defined for unstructured meshes.
		For example, the dual graph $\mathcal D$ can be used to describe the adjacency of cells in a mesh $\mathcal M$.
		The vertices of $\mathcal D$ correspond to the mesh cells $\mathcal E_D \subset \mathcal M$ and edges of the graph $\mathcal D$ represent links between adjacent (or neighbouring) mesh cells.
		The vertices of $\mathcal D$ correspond to the mesh cells $\mathcal E_D \subset \mathcal M$ and edges of the graph $\mathcal D$ represent links between adjacent (or neighboring) mesh cells.
		An example of a small mesh and its dual graph is shown in \cref{fig:dual}.
		The dual graph is usually constructed such that two $D$-dimensional cells are adjacent, if and only if they have a common $(D-1)$-dimensional face, i.e., they share at least $D$ different subvertices.
		A more general definition that can be used, e.g., in the METIS library \cite{karypis:1998fast}, is based on specifying the minimum number of common subvertices $n_{common}$, which two cells need to share to be considered adjacent.
		@@ -238,15 +238,15 @@ The next three definitions specify the type of meshes that will be dealt with in
		If for all mesh entities $E_1, E_2 \in \mathcal M$ the intersection of their closures $\overline E_1 \cap \overline E_2$ is either an empty set or a mesh entity, then $\mathcal M$ is called a conforming mesh.
		\end{definition}

		\Cref{def:conforming mesh} implies that a conforming mesh does not contain so called \emph{hanging entities} which would be subentities of only one of the neighbouring cells.
		\Cref{def:conforming mesh} implies that a conforming mesh does not contain so called \emph{hanging entities} which would be subentities of only one of the neighboring cells.

		\begin{definition}[unstructured mesh]
		\label{def:unstructured mesh}
		A mesh is called an unstructured mesh, if each vertex of the mesh can be a vertex of non-constant number of cells.
		\end{definition}

		\Cref{def:unstructured mesh} implies that the number of superentities of an entity in an unstructured mesh (i.e., the number of neighbouring entities of a higher dimension than the entity itself) is not a~priori constant.
		This is in contrast with the number of subentities of an entity which depends only on the shape of the entity and not on the neighbourhood of the entity.
		\Cref{def:unstructured mesh} implies that the number of superentities of an entity in an unstructured mesh (i.e., the number of neighboring entities of a higher dimension than the entity itself) is not a~priori constant.
		This is in contrast with the number of subentities of an entity which depends only on the shape of the entity and not on the neighborhood of the entity.

		\begin{definition}[homogeneous mesh]
		\label{def:homogeneous mesh}
		@@ -335,7 +335,7 @@ The configuration allows to change the following parameters:
		Each entity stored in the mesh must store the links to its subvertices which provide their geometric realization.
		See the function \ic{subentityStorage} in \cref{appendix:default config}.
		\item
		Pairs of dimensions $d_1, d_2$, where $d_1 < d_2$, for which $d_1$-dimensional entities store the links to their neighbouring $d_2$-dimensional superentities.
		Pairs of dimensions $d_1, d_2$, where $d_1 < d_2$, for which $d_1$-dimensional entities store the links to their neighboring $d_2$-dimensional superentities.
		See the function \ic{superentityStorage} in \cref{appendix:default config}.
		\item
		Dimensions of entities for which the interior/boundary tags are stored.
		@@ -433,15 +433,15 @@ Unfortunately, the term \emph{orientation} is commonly used in two different con
		The core implementation detail is that column indices in each row of the incidence matrices are not sorted.
		Thus, for each mesh entity, we can define local indices of its subentities and superentities corresponding to the positions in the rows related to the entity.
		For example, the incidence matrix $I_{D-1,D}$ contains one or two entries per row which correspond to the cells adjacent to the face given by the particular row index.
		The first entry is assigned a local index 0 and its column index corresponds to the so called \emph{owner cell} of the face, whereas the second entry (if it exists) is assigned a local index 1 and its column index is related to the so called \emph{neighbour cell} of the face.
		The first entry is assigned a local index 0 and its column index corresponds to the so called \emph{owner cell} of the face, whereas the second entry (if it exists) is assigned a local index 1 and its column index is related to the so called \emph{neighbor cell} of the face.
		The orientation of the face can be inferred from the topology of its owner cell (see \cref{appendix:topologies}), because the owner cell is the one from which the face was initialized.

		When appropriate, the orientation of faces needs to be extracted manually from the data structure using a suitable algorithm such as the following.
		For example, given a cell denoted by index $K$ and its adjacent face denoted by index $F$, we are interested in whether the cell $K$ is the owner or neighbour cell of the face $F$.
		For example, given a cell denoted by index $K$ and its adjacent face denoted by index $F$, we are interested in whether the cell $K$ is the owner or neighbor cell of the face $F$.
		First, we can use the templated function \ic{getSuperentitiesCount} to obtain the number of cells adjacent to the face $F$.
		If we get one, it must be the cell $K$ which is also the owner cell.
		Otherwise, we can pass the local index 0 to the templated function \ic{getSuperentityIndex} to obtain the global index $K_0$ of the face's owner cell that can be compared to the index $K$.
		If $K$ equals $K_0$, then $K$ is the owner cell, otherwise it is the neighbour cell and the orientation of $F$ needs to be determined based on $K_0$.
		If $K$ equals $K_0$, then $K$ is the owner cell, otherwise it is the neighbor cell and the orientation of $F$ needs to be determined based on $K_0$.

		\item
		For high-order finite element methods, it is often necessary to define local coordinate systems on each cell such that degrees of freedom associated to faces and edges shared by adjacent cells can be located consistently when the face or edge is viewed from each of the adjacent cells.
		@@ -612,7 +612,7 @@ Next, it can generate \emph{overlaps} (also called \emph{ghost regions}) across
		This results in several layers of vertices and cells close to the subdomain boundary being added to the local mesh (i.e., the mesh representing space discretization of the given subdomain) from other subdomains.
		This is needed for communication patterns between MPI ranks where data associated with mesh entities need to be exchanged.
		Finally, an appropriate renumbering of mesh entities on each subdomain is generated.
		In each dimension, mesh entities are ordered such that entities owned by the current MPI rank are numbered first, followed by the ghost entities owned by the first neighbour, then ghost entities owned by the second neighbour, and so on.
		In each dimension, mesh entities are ordered such that entities owned by the current MPI rank are numbered first, followed by the ghost entities owned by the first neighbor, then ghost entities owned by the second neighbor, and so on.
		This ordering simplifies the implementation and increases the efficiency of data synchronization on ghost regions, which will be described later.

		\subsubsection{Distributed mesh implementation}
		@@ -635,7 +635,7 @@ Specifically, each entity is either \emph{local} or \emph{ghost}, and from the g
		\subsubsection{Data synchronization on ghost regions}

		Computations following the domain decomposition approach typically involve synchronization to ensure data consistency on ghost regions.
		This is done by sending data computed by an MPI rank on its subdomain to the ghost regions of its neighbours, and from the opposite side, receiving data computed by neighbours into the ghost regions around the subdomain.
		This is done by sending data computed by an MPI rank on its subdomain to the ghost regions of its neighbors, and from the opposite side, receiving data computed by neighbors into the ghost regions around the subdomain.

		The data synchronization process is implemented in the \ic{DistributedMeshSynchronizer} class template as a separate data structure that can be instantiated for a specific distributed mesh and dimension $d$ of entities being synchronized.
		The synchronizer can be used to synchronize data arrays allocated anywhere, i.e., host--to--host, host--to--GPU, as well as GPU--to--GPU communication is supported.
		@@ -647,7 +647,7 @@ Then the initialization phase involves the following steps:
		\begin{enumerate}
		\item An unsymmetric $P \times P$ communication matrix $G$ is created such that $G_{i,j}$ represents the
		number of $d$-dimensional ghost entities on rank $i$ that are owned by rank $j$.
		\item Each rank collects the indices of the first entity owned by each neighbour and stores them in an array of size $P$.
		\item Each rank collects the indices of the first entity owned by each neighbor and stores them in an array of size $P$.
		\item Each rank collects the indices of local entities, which are ghosts on a different subdomain.
		This can be represented by a binary sparse matrix with $P$ rows.
		\end{enumerate}
		@@ -662,7 +662,7 @@ In this case, the remaining entities have to be initialized at run-time via MPI
		The algorithm starts by each MPI rank determining and counting all \emph{local} uninitialized entities, exchanging the counts with other MPI ranks, and assigning a global index to all local entities.
		All remaining entities whose global index could not be determined in the first phase are located either on the interface between subdomains or in the ghost region, and their global index must be received from the rank which owns the entity.
		It can be observed that the uninitialized entities can be unambiguously identified by their subvertices, which already have global indices assigned.
		Hence, neighbouring ranks exchange the tuples of global subvertex indices, identify the requested entities in their local mesh and send back global indices to the neighbours.
		Hence, neighboring ranks exchange the tuples of global subvertex indices, identify the requested entities in their local mesh and send back global indices to the neighbors.
		Finally, each rank reorders the newly initialized entities on its local mesh based on the global indices.


		@@ -838,7 +838,7 @@ For brevity, only four representative sets of types are shown in the tables.
		In case of MOAB~\cite{tautges:2004moab}, we used only one set of types (\ic{double} for vertex coordinates, \ic{long int} for global indices and \ic{int} for counts of adjacent entities), which was specified when installing MOAB.
		It can be seen in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures} that selecting the smallest possible types for the \ic{Real}, \ic{GlobalIndex}, and \ic{LocalIndex} template parameters generally leads to faster computations on both CPU and GPU compared to the larger types.
		Finally, we would like to remark that the performance of all benchmarks performed with the TNL library does not depend on other mesh configuration parameters, such as \ic{subentityStorage} or \ic{superentityStorage}, which can be used to disable the storage of unnecessary incidence matrices.
		This behaviour is not included in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures}, because it would lead to duplicate tables with the same values.
		This behavior is not included in \cref{tab:meshes:comptimes:measures,tab:meshes:comptimes:boundary measures}, because it would lead to duplicate tables with the same values.
		Hence, such configuration parameters influence the dynamic size required for the mesh representation as shown in \cref{tab:meshes:memory}, but not the performance of computations using the data structure.

		\begin{table}[bt]
		@@ -925,7 +925,7 @@ The benchmarks were performed on all meshes shown in \cref{tab:meshes poly}, but

		Both benchmarks were performed with different configurations of the \ic{Mesh} class in TNL using the same data types for template parameters as in the previous subsection.
		The configuration of MOAB~\cite{tautges:2004moab} was also the same as in the previous subsection.
		The results for polygonal and polyhedral meshes showed the same behaviour related to the data types as we observed on triangular and tetrahedral meshes: selecting the smallest possible types for the \ic{Real}, \ic{GlobalIndex}, and \ic{LocalIndex} template parameters generally leads to faster computations on both CPU and GPU compared to the larger types, and the performance of all benchmarks performed with the TNL library does not depend on other mesh configuration parameters, such as \ic{subentityStorage} or \ic{superentityStorage}, which can be used to disable the storage of unnecessary incidence matrices.
		The results for polygonal and polyhedral meshes showed the same behavior related to the data types as we observed on triangular and tetrahedral meshes: selecting the smallest possible types for the \ic{Real}, \ic{GlobalIndex}, and \ic{LocalIndex} template parameters generally leads to faster computations on both CPU and GPU compared to the larger types, and the performance of all benchmarks performed with the TNL library does not depend on other mesh configuration parameters, such as \ic{subentityStorage} or \ic{superentityStorage}, which can be used to disable the storage of unnecessary incidence matrices.
		Hence, such configuration parameters influence the dynamic size required for the mesh representation as shown in \cref{tab:meshes:memory poly}, but not the performance of computations using the data structure.

		Similarly to the simplex meshes, the cell measure calculations shown in \cref{tab:meshes:comptimes:measures poly} have different characteristics on polygonal and polyhedral meshes.