Commit 6d8aeb32 authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

copy-edit - addressed comments by TO

parent b1d52e01
Loading
Loading
Loading
Loading
+7 −4
Original line number Diff line number Diff line
@@ -33,7 +33,7 @@ The induced dimension reduction methods, denoted as IDR($s$), are a new family o
Although they were reported to outperform traditional methods such as BiCGstab for several problems, they are still much less popular.
In general, the research focuses on the development of more efficient preconditioners rather than new iterative methods themselves.

\inline{stopping criteria?}
%\inline{stopping criteria?}

\section{Preconditioning techniques}
\label{sec:preconditioning}
@@ -61,10 +61,13 @@ is also possible, where the preconditioner is $\matrix M = \matrix M_1 \matrix M

There is no consensus on which kind of preconditioning is the best and how to choose the preconditioning matrix $\matrix M$.
The choices typically depend on the iterative method used as well as on specific matrix properties for the given problem, however, several general-purpose preconditioners that do not require problem-specific input have been developed.
In general, the minimal requirements imposed on the preconditioner $\matrix M$ are that it must be non-singular and it should approximate $\matrix A$ in some sense, such that $\matrix M^{-1} \matrix A$ and $\matrix A \matrix M^{-1}$ are close to the identity matrix $\matrix I$.
Note that the preconditioned matrices $\matrix M^{-1} \matrix A$ and $\matrix A \matrix M^{-1}$ are not assembled explicitly as that would be too expensive and sparsity would be lost.
Firstly, the minimal requirement imposed on the preconditioner $\matrix M$ is that it must be non-singular.
It may also be required to preserve the symmetry of the original matrix $\matrix A$ in the preconditioned system.
One way to achieve this is to use the split-preconditioning approach with $\matrix M = \matrix M_1 \matrix M_2$, where $\matrix M_1$ and $\matrix M_2$ are non-singular and $\matrix M_1 = \matrix M_2^T$.
Additionally, the preconditioner should approximate $\matrix A$ in some sense, such that $\matrix M^{-1} \matrix A$ is close to the identity matrix $\matrix I$.
Note that the matrix of the preconditioned system (i.e., $\matrix M^{-1} \matrix A$, $\matrix A \matrix M^{-1}$, or $\matrix M_1^{-1} \matrix A \matrix M_2^{-1}$) is not assembled explicitly as that would be too expensive and sparsity would be lost.
Instead, additional requirements are that the operation $\matrix M^{-1} \vec v$ should be easy to apply to an arbitrary vector $\vec v$ and it should offer opportunities for an efficient parallel implementation.
However, in practice, the last three requirements often go against each other, which poses a major difficulty to the development of high performance preconditioners.
However, in practice, the last three requirements often go against each other: it is difficult to develop a preconditioner which is all \emph{effective} (i.e., approximates the matrix $\matrix A$ well), \emph{cheap} (i.e., can be applied easily), and \emph{parallel} at the same time.

\subsection{State of the art}

+10 −8
Original line number Diff line number Diff line
The content of this section is based on the paper~\cite{klinkovsky:2022meshes} that presents a data structure for \emph{conforming unstructured homogeneous} meshes implemented in TNL~\cite{oberhuber:2021tnl}.
The initial implementation was designed by Vítězslav Žabka and later it was generalized and improved by the author into the presented form.
Since the publication of the paper in the journal, we have extended the data structure for \emph{polygonal} and \emph{polyhedral} meshes.
The implementation of this extension in TNL was conducted by Ján Bobot as part of his Master's Degree project~\cite{bobot:2022} under the supervision of the author.
The extensions for polygonal and polyhedral meshes are incorporated into the text in this section.
@@ -30,7 +31,7 @@ Although the aforementioned libraries are very general in the sense that they fa
For example, some libraries focus on adaptive mesh refinement of conforming simplex meshes, while others focus on the handling of non-conforming and hierarchical meshes.
From the performance point of view, the development has focused on parallel computations on distributed memory systems and scaling to thousands of nodes that are available on current supercomputers.
Another important feature is the support for efficient computations on the GPU accelerators which have already been used successfully for numerical simulations on unstructured meshes \cite{balogh:2017comparison,sulyok:2019locality,zayer2017gpu,giuliani2017face,qin:2019accelerating}.
GMlib~\cite{marechal:2020gmlib} provides unstructured mesh data structures for programmers using the OpenCL language, but in general native CUDA programming for Nvidia GPUs is preferred in high-performance computing, since high-level parallelisation approaches such as OpenMP or OpenACC cannot compete with it in terms of performance~\cite{balogh:2017comparison}.
GMlib~\cite{marechal:2020gmlib} provides unstructured mesh data structures for programmers using the OpenCL language, but in general, native programming languages such as CUDA for Nvidia GPUs are usually preferred in high-performance computing due to higher flexibility and performance \cite{balogh:2017comparison}.
However, most work has focused on the algorithmic aspects of GPU computing rather than providing a multi-purpose data structure and the code of these works is not easily reusable, either because it is not open-source, or because it is too tightly coupled with a specific application.
Another problem is that extending existing software for GPUs while preserving efficiency is practically impossible due to different memory layout requirements.

@@ -269,7 +270,7 @@ A $D$-dimensional mesh composed of cells with static topologies can be fully spe
In case of a simplex mesh, the lists of subvertices can be unordered and the relations between cells and its subvertices can be fully represented by the incidence matrix $I_{D,0}$.
In case of a non-simplex mesh, the lists of subvertices have to be ordered and thus the binary incidence matrix $I_{D,0}$ represents only the connectivity, but not topological information.
The authors of \cite{zayer2017gpu} define a non-binary \emph{mesh matrix} which has the same pattern as the incidence matrix and the non-zero entries specify local indices of the corresponding subentities.
However, the local ordering of subentities can be encoded in the representation of the corresponding binary incidence matrix.
However, the local ordering of subentities can be encoded in the representation of the corresponding binary incidence matrix, avoiding the need to explicitly store the non-zero entries.
When a sparse matrix format is used for its representation, the non-zero entries in each row can be placed according to the \emph{local ordering} rather than the usual ordering given by the global subentity indices.
Such an \emph{enhanced} incidence matrix $I_{D,0}$ fully represents the connectivity as well as topological information about the cells.
The relations between other mesh entities can be deduced from the incidence matrix $I_{D,0}$ based on the topology of a specific cell.
@@ -481,7 +482,7 @@ Unfortunately, the term \emph{orientation} is commonly used in two different con
\begin{enumerate}
    \item
        In case of the finite volume method, one is interested in the direction of the normal vectors uniquely assigned to each face of the mesh.
        Each face can have up to two adjacent cells and its normal vector can be oriented two ways: either outward or inward with respect to the first adjacent cell (i.e., inward or outward, respectively, with respect to the other cell).
        Each face can have up to two adjacent cells and its normal vector can be oriented in two ways: either outward or inward with respect to the first adjacent cell (i.e., inward or outward, respectively, with respect to the other cell).
        %An explicit representation of this information requires storing one index per face, e.g. the global index of the adjacent cell from which the normal vector on the face points outwards.
        This information is also sufficient for some low-order finite element discretizations, e.g. the lowest order Raviart--Thomas elements that we used in~\cite{fucik:2019NumDwarf}.

@@ -506,7 +507,7 @@ Unfortunately, the term \emph{orientation} is commonly used in two different con
        This case is more general than the former since entities other than faces are considered (e.g. edges in 3D) and even the direction of normal vectors on faces can be based on consistent local ordering of face subvertices.

        However, the current implementation of the presented data structure does not provide any way to deal with general mesh entity orientations according to this context.
        Hence, implementing computational methods that need this information, such as high-order finite element methods, might require using additional data structures to store the necessary mesh entity orientations, or following a different approach that would avert the need for additional storage.
        Hence, implementing computational methods that need this information, such as high-order finite element methods, might require using additional data structures to store the necessary mesh entity orientations, or following a different approach that would avoid the need for additional storage.
        Various strategies on dealing with this general case can be found in~\cite{agelek2017orienting}.
\end{enumerate}

@@ -600,7 +601,6 @@ The sparse matrix formats described in \subsubsecref{sec:meshes:internal data st
For example, if we access the incidence matrix $I_{2,1}$ to obtain the indices of edges incident to a two-dimensional mesh entity, and then use the obtained indices to query the incidence matrix $I_{1,0}$ to obtain the indices of edge subvertices, the efficiency of accessing the latter incidence matrix cannot be guaranteed and depends on the locality of obtained edge indices.
A common optimization strategy for similar scenarios is to improve data locality by applying a suitable reordering algorithm~\cite{sulyok:2019locality}.
Common approaches typically involve an \emph{in-order traversal} of a quadtree or octree which considers the geometric locality of mesh entities, or an application of a suitable space-filling curve such as the Hilbert curve~\cite{hilbert1891}, or a graph ordering such as the Cuthill--McKee algorithm~\cite{cuthill:1969reducing} which is known to reduce the bandwidth of the graph's adjacency matrix, or a graph colouring technique such as~\cite{giuliani2017face}.
\todo{Add some figures for mesh ordering algorithms?}

The \ic{Mesh} class template provides an interface for applying any custom ordering of mesh entities.
The $d$-dimensional entities of a mesh can be reordered by calling the templated member function \ic{reorderEntities} and specifying $d$ as the template parameter:
@@ -700,9 +700,10 @@ The synchronizer can be used to synchronize data arrays allocated anywhere, i.e.
Note that thanks to the so-called CUDA-aware MPI \cite{kraus:cuda-aware-mpi}, we do not have to deal with handling custom buffers when receiving or sending data allocated on the GPU.

The synchronizer needs to be initialized before its use.
The initialization phase involves the following steps:
Let $P$ denote the number of MPI ranks among which the mesh is distributed.
Then the initialization phase involves the following steps:
\begin{enumerate}
    \item An unsymmetric $P \times P$, where $P$ stands for the number of MPI ranks, communication matrix $G$ is created such that $G_{i,j}$ represents the
    \item An unsymmetric $P \times P$ communication matrix $G$ is created such that $G_{i,j}$ represents the
    number of $d$-dimensional ghost entities on rank $i$ that are owned by rank $j$.
    \item Each rank collects the indices of the first entity owned by each neighbour and stores them in an array of size $P$.
    \item Each rank collects the indices of local entities, which are ghosts on a different subdomain.
@@ -844,7 +845,8 @@ The code of all benchmarks described in this subsection is available as open-sou

To compare the results with the hardware peak memory throughput, we calculate an effective bandwidth $EBW$ for each algorithm.
The $EBW$ is calculated by dividing the \emph{useful data size} for the given computation by the achieved computational time.
For both benchmark algorithms, the useful data size consists of the size of all vertex coordinates, the size of all indices read by the algorithm, and the size of output data written by the algorithm to the memory.\ask{Should I note that this is different from \cite{klinkovsky:2022meshes}?}
For both benchmark algorithms, the useful data size consists of the size of all vertex coordinates, the size of all indices read by the algorithm, and the size of output data written by the algorithm to the memory.
Note that this is a more accurate approach compared to \cite{klinkovsky:2022meshes} where the output data size is not considered in $EBW$ calculations.
The number of write operations equals $\#\mathcal C$ in the first benchmark and approximately $2 \cdot \#\mathcal F$ in the second benchmark.
The number of indices read by each algorithm depends on the mesh type.

+2 −2
Original line number Diff line number Diff line
\inline{Podrobnosti bych vynechal, je to popsané v diplomce a na doktorátu jsem s tím nic moc nedělal... (pouze distribuovanou verzi -- což se ale velmi hodí pro LBM i s tím synchronizerem)}
\inline{Comparison with other libraries}

Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices and multidimensional arrays are a natural data structure for representing such values in the computer memory.
Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices.
Multidimensional arrays are therefore a natural data structure for organizing and storing such values in the computer memory.
Since the \C++ language supports only one-dimensional arrays natively, multidimensional arrays have to be implemented explicitly (e.g., in a library) based on a mapping of multidimensional data to an internal one-dimensional array.

An interesting problem is how to choose the mapping from the multidimensional index space into the one-dimensional index space.