copy-edit - addressed comments by PS (8246e5e0) · Commits · Jakub Klinkovský / Dissertation

content/introduction.tex

+7 −6

Original line number	Diff line number	Diff line
		@@ -4,8 +4,9 @@ However, using the facilities for high performance computing efficiently is non-
		Especially when designing algorithms for systems with GPU accelerators, which provide significantly more processing units as well as memory levels compared to traditional computational systems, specifics of the hardware architectures have to be considered.

		Since the field of computational fluid dynamics includes many substantially different applications, the most important requirements imposed on the building blocks of numerical solvers are configurability and composability.
		The former allows to adapt a piece of software, such as a data structure, for a specific application, and the latter allows to combine multiple existing pieces together to resolve a new use case.
		Similarly, it is desirable to facilitate coupling even high-level tools and methods in order to provide more general solvers for multiphysics problems that arise in practice.
		The former allows to adapt a piece of software, such as a data structure, for a specific application.
		The latter allows to combine multiple existing pieces together to resolve a new use case.
		Similarly, it is desirable to facilitate coupling even high-level tools and methods \todo{what was meant by this?} in order to provide more general solvers for multiphysics problems that arise in practice.

		The purpose of this thesis is the development of efficient data structures and parallel algorithms that can be used as fundamental ingredients in solvers based on, e.g., the lattice Boltzmann method or the mixed-hybrid finite element method.
		Development of data structures for organizing structured as well as unstructured data in numerical simulations is necessary in order to match the requirements for efficient utilization of modern supercomputers.
		@@ -27,8 +28,8 @@ Many data structures and algorithms for GPU accelerators are available in common
		An important problem in linear algebra is the solution of sparse linear systems, which is covered later in this thesis in \cref{chapter:linear systems}.

		Unlike linear algebra, other fields such as mesh computations are not as established and may not even have a fixed universal system of notation.
		Hence, there is not one obvious way to do formulate problems and the development of reusable software components is inherently difficult.
		Overall, there is a lack of software libraries providing a robust and multi-platform data structure for the representation of unstructured meshes.
		Hence, there is no obvious way to formulate problems and the development of reusable software components is inherently difficult.
		Overall, there is a lack of software libraries providing robust and multi-platform data structures for the representation of unstructured meshes.
		Moreover, multiple types of meshes are used in practice, e.g., tetrahedral, hexahedral, and polyhedral.
		While tetrahedral meshes are the easiest to use in numerical methods such as finite elements, hexahedral and especially polyhedral meshes are advantageous in complex cases with high amount of data per cell.
		Compared to other types, polyhedral meshes need smaller number of cells to cover a given domain and numerical methods such as finite volumes can be designed carefully with the same level of accuracy that can be achieved on tetrahedral meshes.
		@@ -37,7 +38,7 @@ Numerous computational tools based on numerical methods such as finite volumes o
		In particular, software projects such as deal.II~\cite{bangerth:2007deal.II}, DUNE~\cite{bastian:2006DUNE}, OpenFOAM~\cite{jasak:2007openfoam}, TOUGH2~\cite{pruess:1999TOUGH2}, MFiX~\cite{syamlal:1993}, ANSYS Fluent~\cite{ansys-fluent:2009} or COMSOL Multiphysics~\cite{COMSOLMultiphysics1998} are suitable for simulations in the field of computational fluid dynamics.
		Additionally, the lattice Boltzmann method (LBM) has become popular for turbulent flow simulations \cite{kang2013,geier2015cumulant,kumar2018,peng2018,wittmann2013,zakirov2021}.
		While many of the aforementioned projects are open-source and thus can be freely modified, it is difficult to incorporate novel approaches and methods into extensive software packages, especially for external users.
		Hence, significant stream of innovation originates from small separate projects that gradually either evolve into larger projects, or get incorporated in existing software.
		Hence, significant stream of innovation originates from small separate projects that gradually either evolve into larger projects, or get incorporated into existing software.
		This thesis builds on the author's previous work \cite{klinkovsky:2017thesis}, an implementation of the mixed-hybrid finite element method for multiphase compositional flow in porous media, which can utilize NVIDIA GPU accelerators.
		Additionally, the work extends the LBM implementation that is developed at the Department of Mathematics, FNSPE CTU in Prague.

		@@ -56,7 +57,7 @@ To the best of our knowledge, they represent unique ideas that push forward the
		There are many competing libraries with similar features even in established fields such as linear algebra, where one might hope for the existence of a universal project collecting contributions from all over the world.
		However, competition inspires innovation and maintaining a separate project provides its developers with an easy way to try and share ideas with others.

		Note that this goal is never-ending, it lies in continuous effort rather than achieving a target.
		Note that this goal is never-ending, it lies in continuous effort rather than reaching a milestone.
		\item
		To develop an efficient data structure for the representation of conforming unstructured meshes with full support of modern distributed computing platforms based on GPU accelerators.
		\item

content/ndarrays.tex

+1 −1

Original line number	Diff line number	Diff line
		@@ -97,7 +97,7 @@ Before the algorithm can be started, each MPI rank must configure the synchroniz
		\end{itemize}

		The synchronization procedure consists of the steps summarized in \cref{alg:distributed ndarray synchronization}.
		Note that some of the steps in the algorithm are not always necessary.
		Note that some steps in the algorithm may sometimes be unnecessary.
		For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.

		\begin{algorithm}[Distributed multidimensional array synchronization]

content/programming_and_architectures.tex

+12 −9

Original line number	Diff line number	Diff line
		@@ -16,7 +16,8 @@ This thesis does not aim to repeat the summary, nevertheless, we need to highlig
		Overall, GPU accelerators are advantageous for compute-bound as well as memory-bound data parallel applications.

		To make the collection of individual accelerators and compute nodes scalable to the size of supercomputers, fast and scalable interconnections between the individual units are necessary.
		Technologies such as NVLink and NVSwitch \cite{nvidia:2020A100,Li2020} allow for high-throughput and low-latency communication between GPUs in a single node and inter-node communication typically relies on switched fabric network interconnections.
		Technologies such as NVLink and NVSwitch \cite{nvidia:2020A100,Li2020} allow for high-throughput and low-latency communication between GPUs in a single node.
		Inter-node communication typically relies on switched fabric network interconnections.
		State--of--the--art solutions for high-performance computing provide transfer speeds up to \SI{100}{Gbit\per\second} per link \cite{enwiki:InfiniBand}, latency around \SI{0.5}{\micro\second} \cite{enwiki:InfiniBand}, remote direct memory access (RDMA) capabilities to minimize CPU overhead \cite{Potluri2013,Li2020}, and acceleration of collective communication operations \cite{Graham2010,Schneider2013}.
		Compute nodes can be organized in various network topologies such as fat tree or dragonfly with multiple levels of network switches and links on higher levels can be aggregated in order to increase the overall bandwidth of the network.

		@@ -169,8 +170,10 @@ When all work is finished, the threads are joined and the sequential processing
		OpenMP provides many other clauses that may be used in the \texttt{\#pragma omp} directives in more complex cases.
		It is obvious that OpenMP greatly simplifies parallelization in straightforward cases where the provided \texttt{\#pragma} directives are sufficiently expressive.
		However, the reliance on compiler directives removes some level of control from the programmer, which may lead to workflow and performance problems.
		Most notably, OpenMP directives may conflict with features introduced in modern \C++ standards, debugging synchronization bugs and race conditions may be more difficult than in other frameworks and code modifications due to \texttt{\#pragma} directives may prevent compiler optimizations such as automatic vectorization.
		Also note that using OpenMP to manage threads may conflict with other multi-threading libraries such as POSIX threads or TBB.
		Most notably, OpenMP directives may conflict with features introduced in modern \C++ standards \cite{iliev:OpenMP_vs_C++11}.
		Also due to \emph{ghost code} added by the compiler in place of \texttt{\#pragma} directives, debugging synchronization bugs, race conditions and performance regressions may be more difficult compared to other frameworks.
		Furthermore, OpenMP does not guarantee interoperability with other multi-threading libraries.
		Hence, combining OpenMP with TBB, POSIX threads, or other library results in undefined behavior, which leads to non-portable code in the best case.

		OpenACC \cite{OpenACC} is a standard similar to OpenMP, which is designed to simplify parallel programming for heterogeneous systems consisting of multi-core processors and accelerators.
		The programming style of OpenACC has basically the same advantages and disadvantages as OpenMP.
		@@ -345,7 +348,7 @@ __global__ void kernel_axpy(index_t size, value_t alpha, value_t* x, value_t* y)
		}
		\end{cppcode}

		Same as in the OpenCL example, we define an auxiliary function for convenient error checking:
		As in the OpenCL example, we define an auxiliary function for convenient error checking:
		\begin{cppcode}
		void checkErr(cudaError_t err, const char* name)
		{
		@@ -376,7 +379,7 @@ void parallel_axpy(index_t size, value_t alpha, value_t* x, value_t* y)
		err = cudaMemcpy(device_y, y, sizeof(value_t) * size, cudaMemcpyHostToDevice);
		checkErr(err, "cudaMemcpy");
		\end{cppcode}
		When the input data is prepared in the device memory, the kernel can be launched with thread block and grid sizes appropriate for our range of input dataset.
		When the input data is prepared in the device memory, the kernel can be launched with thread block and grid sizes appropriate for the size of input dataset.
		Without explaining the thread hierarchy in the CUDA programming model, these values can be used for the \ic{axpy} operation with 1D data:
		\begin{cppcode*}{autogobble=false, linenos, firstnumber=18}
		dim3 blockSize, gridSize;
		@@ -481,7 +484,7 @@ Many books are dedicated to explaining the basics of MPI programming as well as
		Here we describe only the most important concept, \emph{communicators}.
		A communicator is an object that provides a context for communication among a group of MPI processes.
		Each communicator is characterized by its ID/tag that differentiates messages from different communicators, and a group of processes that provides a rank-naming mechanism based on numeric IDs that are used for addressing of the messages.
		The most common communicator is denoted by the constant \ic{MPI_COMM_WORLD} and comprises all ranks participating in the invokation of an MPI program.
		The most common communicator is denoted by the constant \ic{MPI_COMM_WORLD} and comprises all ranks participating in the invocation of an MPI program.
		Communicators are represented by the type \ic{MPI_Comm} and MPI provides several functions to create a custom communicator, for example:
		\begin{itemize}
		\item
		@@ -682,7 +685,7 @@ When the data is prepared, the execution policy can be defined and the \emph{axp
		}
		);
		\end{cppcode*}
		Finally, we must remember to copy the result back to the host memory in case the execution happened in a device space:
		Finally, we must remember to copy the result back to the host memory in case the execution happened in the device space:
		\begin{cppcode*}{autogobble=false, linenos, firstnumber=22}
		// copy the y array from device memory to host memory
		Kokkos::deep_copy(y_host, y);
		@@ -717,7 +720,7 @@ TNL does not follow a typical design for heterogeneous platforms where most comp
		Instead, TNL is designed for numerical solvers where most computations are performed either entirely on CPU or a GPU accelerator.
		This is achieved by a \ic{Device} template parameter that allows to switch between computations on CPU, GPU, and potentially other platforms.
		Of course, it is still necessary to rely on CPU for sequential tasks such as initialization or data output, but the majority of time-consuming computations can be parametrized by \ic{Device}.
		This approach requires all data used in the computations to be allocated in the memory of the accelerator in order to minimize data movement between the system memory and accelerators, otherwise computations would be limited by communication over the PCI-E bus.
		This approach requires all data used in the computations to be allocated in the memory of the accelerator in order to minimize data movement between the system memory and accelerators, otherwise computations would be limited by communication over the PCIe bus.
		Data structures implemented in TNL provide an easy way to manage data allocated in different address spaces.

		Finally, an important aspect of the TNL design is that it simplifies the development of portable parallel algorithms, a goal that is shared with other high-level libraries such as Thrust or Kokkos.
		@@ -780,7 +783,7 @@ More advanced algorithms include sparse matrix--vector multiplication and other
		Note that the list is not exhaustive and other algorithms may be implemented in the future.

		Finally, we show an example implementing the \emph{axpy} operation in TNL.
		Same as in the preceding sections, the input data is allocated in the host memory.
		As in the preceding sections, the input data is allocated in the host memory.
		A parallel version for the \ic{Host} device can be implemented using the \ic{parallelFor} function and a simple lambda expression as follows:
		\begin{cppcode}
		void host_axpy(index_t size, value_t alpha, value_t* x, value_t* y)

content/vapor_transport.tex

+1 −1

Original line number	Diff line number	Diff line
		@@ -33,7 +33,7 @@ There are multiple reasons why we pursue this approach:
		For some problems, it may provide higher accuracy and robustness compared to standard FEM or FVM approaches \cite{radu2011,yu2020}.
		\item
		Both LBM and MHFEM individually can be efficiently parallelized and implemented for modern high-performance architectures, including GPU accelerators.
		All computations in the coupled LBM-MHFEM solver can be executed entirely on a GPU accelerator in order to utilize its computational power and avoid the hardware limitations caused by slow communication between the GPU and CPU over the PCI-E bus.
		All computations in the coupled LBM-MHFEM solver can be executed entirely on a GPU accelerator in order to utilize its computational power and avoid the hardware limitations caused by slow communication between the GPU and CPU over the PCIe bus.
		\item
		The coupled solver can utilize vast computational resources available on typical supercomputers by decomposing the domain and dividing the computation between multiple workers (GPUs) which communicate over the Message Passing Interface (MPI) \cite{mpi:3.1}.
		\end{itemize}

references_parallel_programming.bib

+7 −0

Original line number	Diff line number	Diff line
		@@ -614,6 +614,13 @@
		isbn = {978-3-642-16478-1},
		}

		@Online{iliev:OpenMP_vs_C++11,
		author = {Hristo Iliev},
		title = {Answer to {Can I safely use OpenMP with C++11?} on {StackOverflow}},
		url = {https://stackoverflow.com/a/13839719},
		year = {2012},
		}

		@Comment{jabref-meta: databaseType:bibtex;}

		@Comment{jabref-meta: protectedFlag:true;}