chapter 1 - added content to the TNL section (96c079c8) · Commits · Jakub Klinkovský / Dissertation

content/programming_and_architectures.tex

+183 −12

Original line number	Diff line number	Diff line
		@@ -691,18 +691,189 @@ Note that the \ic{Kokkos::deep_copy} function has no effect when the \ic{y_host}

		\section{Template Numerical Library}

		\inline{Supporting paper: \cite{oberhuber:2021tnl}}
		In this section, we describe Template Numerical Library, an open-source project that is developed mainly by the author and his supervisor.
		The content is based on the paper \cite{oberhuber:2021tnl} with updates due to recent changes in the library.
		The text provides a high-level view on the project, detailed description of the features and usage examples can be found in its documentation \cite{TNL:documentation}.

		\subsection{Introduction}

		Template Numerical Library \cite{oberhuber:2021tnl} is a project that aims to simplify the development of high-performance numerical solvers and methods, especially in the field of computational fluid dynamics.
		It is an open-source project developed in \C++ and relies on modern features of the language, currently the \CPPseventeen standard.
		In order to simplify installation and inclusion in other projects, TNL is designed as a header-only library.

		The name refers to \C++ template meta-programming techniques that are used throughout the library and often preferred over corresponding run-time approaches involving virtual functions that have limited support in the CUDA framework \cite{nvidia:cuda}.
		In general, this approach also leads to a more efficient code \cite{coplien:crtp,nidito:templates-vs-virtual} at the cost of increased work at compile-time.
		Moreover, it encourages detection of programming errors at compile-time.
		However, considering the development of modern \C++ standards, using the word \enquote{template} as part of the name does not differentiate it from other libraries, since any successful \C++ library must use templates.

		Of the high-level libraries described in the previous section, Kokkos is the most general parallel programming and performance portability library.
		TNL tries to be the next one and achieves it by targeting a broader scope of problems.
		TNL is not only a parallel programming library, but provides also algorithms and data structures for dense and sparse linear algebra, structured grids and unstructured meshes, and other building blocks for advanced numerical solvers.
		Furthermore, TNL has wrappers to simplify using low-level MPI functions in \C++ and provides distributed data structures with synchronizers based on MPI.

		TNL does not follow a typical design for heterogeneous platforms where most computations are performed by the CPU and specific tasks are offloaded to an accelerator, or where the work is divided between accelerators and the CPU.
		Instead, TNL is designed for numerical solvers where most computations are performed either entirely on CPU or a GPU accelerator.
		This is achieved by a \ic{Device} template parameter that allows to switch between computations on CPU, GPU, and potentially other platforms.
		Of course, it is still necessary to rely on CPU for sequential tasks such as initialization or data output, but the majority of time-consuming computations can be parametrized by \ic{Device}.
		This approach requires all data used in the computations to be allocated in the memory of the accelerator in order to minimize data movement between the system memory and accelerators, otherwise computations would be limited by communication over the PCI-E bus.
		Data structures implemented in TNL provide an easy way to manage data allocated in different address spaces.

		Finally, an important aspect of the TNL design is that it simplifies the development of portable parallel algorithms, a goal that is shared with other high-level libraries such as Thrust or Kokkos.
		This means that algorithms or whole numerical solvers can be developed and debugged on CPUs, for which many development tools are available, and then it can be switched to a different platform by changing a single parameter.
		Based on our experience, many algorithms developed this way work on GPUs with no or minor modifications due to different behavior of the parallel platforms.

		\subsection{Parallel programming components}

		TNL is based on a backend system for modern parallel architectures such as multi-core CPUs and GPU accelerators, resulting in a high-level interface providing a simple and portable way of programming these platforms.
		The current backends are based on OpenMP \cite{OpenMP} for CPU parallelization and CUDA \cite{nvidia:cuda} for GPU parallelization.
		Adding support for backends based on HIP \cite{amd:hip} and SYCL \cite{khronos:sycl} is planned for the future.

		Similar to Kokkos, the parallel programming interface in TNL is based on abstractions for various memory and execution operations.
		The concept of \emph{allocators} takes care of memory allocation, which is the fundamental memory operation and corresponds roughly to the \emph{memory space} concept in Kokkos.
		TNL provides the following allocators in the \ic{TNL::Allocators} namespace with an interface compatible with STL: \ic{Host} (alias for \ic{std::allocator}), \ic{Cuda} (allocations in the global memory of a GPU using \ic{cudaMalloc}), \ic{CudaHost} (allocations in the host memory using \ic{cudaMallocHost}), and \ic{CudaManaged} (allocations in the unified memory space using \ic{cudaMallocManaged}).
		Next, TNL provides interface for additional memory operations such as data copies (including source and target with mismatched address spaces), comparison, construction and destruction of elements (objects) in allocated memory.

		The main abstraction for code execution in TNL is realized by the \ic{Device} template parameter of various functions and classes.
		A \emph{device} determines where and how the code is executed.
		TNL currently provides the following devices in the \ic{TNL::Devices} namespace: \ic{Sequential} (sequential execution either on CPU or within a GPU thread), \ic{Host} (parallel execution on CPU via OpenMP), and \ic{Cuda} (parallel execution on a GPU).
		Note that the device concept is not fixed and the implementation of additional backends may require significant refactoring or even introduction of more refined concepts.

		Furthermore, TNL provides fundamental parallel algorithms for common operations: \ic{parallelFor}, \ic{reduce}, \ic{scan}, and \ic{sort}.
		These correspond to the same functionality present in other high-level libraries and some parallel programming frameworks, such as Kokkos, SYCL, and TBB.
		The algorithms in TNL use indices and lambda expressions for data access, not iterators like Thrust and STL.
		Some algorithms like \ic{parallelFor} expose an interface for fine-tuning execution details, such as specifying a CUDA stream or dynamic shared memory size for CUDA kernels.
		The parameters are specified via an instance of the \ic{LaunchConfiguration} structure, which roughly corresponds to the \emph{execution policy} concept in Kokkos.

		Finally, TNL provides components for distributed computing based on MPI \cite{mpi:3.1}.
		These are based on low-level wrappers for MPI functions with C interface, which simplify their usage in \C++ and provide basic portability layer (i.e., the wrappers provide common behavior for builds without MPI, but switching to a different distributed computing framework is not supported).
		Higher-level utilities are provided for common operations, e.g., \ic{reduce} function template for scalar values, \ic{send} and \ic{recv} function templates for the \ic{Array} data structure, etc.
		On the highest level, distributed data structures are implemented in order to provide an object-oriented view on distributed computing.

		\subsection{Data structures}

		TNL provides many common dynamic data structures, such as \ic{Array} and \ic{Vector}, \ic{DenseMatrix} and \ic{SparseMatrix}, or \ic{NDArray} (see \cref{sec:ndarray} for details).
		Additional variants are implemented for each of the aforementioned data structures:
		\begin{itemize}
		\item \emph{Views}:
		Each dynamic data structure has an associated view data structure, which provides an interface with non-owning and shallow-copy semantics.
		As a consequence, views cannot be resized, can be rebound to other data (including external data structures), can be passed by value to device functions such as CUDA kernels, and can be captured by value in lambda expressions.
		Hence, they simplify data access in code for GPU accelerators.
		\item \emph{Distributed data structures}:
		For each aforementioned data structure there is a distributed variant (e.g., \ic{DistributedArray} etc.) that implements an interface for object-oriented distributed computing via MPI.
		Each distributed data structure uses the original dynamic data structure to manage the local data and provides access to global and local views of the data.
		\end{itemize}
		Besides dynamic data structures, TNL provides \ic{StaticArray}, \ic{StaticVector} and \ic{StaticNDArray} that manage data in fixed-size objects created in the \emph{stack segment} of memory.
		They are usable in device code and even constant expressions (e.g., \ic{constexpr} functions).

		Additionally, TNL provides advanced data structures specific to numerical applications, such as grids and meshes.
		The data structure for unstructured meshes is described in detail in \cref{sec:meshes}.
		Similarly to the \emph{memory layout} concept in Kokkos, most of the TNL data structures can be configured with a template parameter to adjust the layout of the data in memory (e.g., row-major or column-major matrices).
		TNL currently does not have an analogy to the \emph{memory traits} concept in Kokkos.

		\subsection{Algorithms}

		TNL provides high-level functions for common operations based on the fundamental parallel algorithms, for example, \ic{compare}, \ic{contains}, \ic{containsOnlyValue}, or vector operations based on expression templates \cite{veldhuizen1995,Matsuzaki2010}.
		More advanced algorithms include sparse matrix--vector multiplication and other matrix operations, iterative solvers for systems of linear algebraic equations, preconditioners, and solvers for systems of ordinary differential equations.
		Note that the list is not exhaustive and other algorithms may be implemented in the future.

		Finally, we show an example implementing the \emph{axpy} operation in TNL.
		Same as in the preceding sections, the input data is allocated in the host memory.
		A parallel version for the \ic{Host} device can be implemented using the \ic{parallelFor} function and a simple lambda expression as follows:
		\begin{cppcode}
		void host_axpy(index_t size, value_t alpha, value_t* x, value_t* y)
		{
		// define the device type and execute the kernel on the host
		using device_t = TNL::Devices::Host;
		TNL::Algorithms::parallelFor<device_t>(
		0, size, // iteration range
		[=] (index_t i) { // kernel function
		y[i] = alpha * x[i] + y[i];
		}
		);
		}
		\end{cppcode}
		Alternatively, expression templates implemented in TNL can be used to evaluate the vector operation.
		This leads to even clearer code as the meta-code associated with the \ic{parallelFor} function is avoided.
		We only need to create a \ic{VectorView} object for each vector in the expression:
		\begin{cppcode}
		void host_axpy(index_t size, value_t alpha, value_t* x_raw, value_t* y_raw)
		{
		// create host views for the input arrays
		using host_view = TNL::Containers::VectorView<value_t, TNL::Devices::Host>;
		host_view x(x_raw, size);
		host_view y(y_raw, size);

		// compute the vector expression
		y = alpha * x + y;
		}
		\end{cppcode}

		Both variants of the \ic{host_axpy} function can be adapted for computations on a GPU accelerator.
		First, we create \ic{ArrayView} objects associated to the \ic{Host} device for the input arrays, then allocate \ic{Array} objects associated to the \ic{Cuda} device, and copy the data from the host memory to the device memory:
		\begin{cppcode}
		void device_axpy(index_t size, value_t alpha, value_t* x_raw, value_t* y_raw)
		{
		// define the device type
		using device_t = TNL::Devices::Cuda;

		// create host views for the input arrays
		using host_view = TNL::Containers::ArrayView<value_t, TNL::Devices::Host>;
		host_view x_host(x_raw, size);
		host_view y_host(y_raw, size);

		// allocate arrays in the device memory and initialize them with the host data
		using array_t = TNL::Containers::Array<value_t, device_t>;
		array_t x_device, y_device;
		x_device = x_host;
		y_device = y_host;
		\end{cppcode}
		In order to access the data in \ic{x_device} and \ic{y_device} from a lambda expression in the \ic{parallelFor} function, we must create views for the data, which can be captured by value in the lambda expression.
		Note that the lambda expression must be marked with the \ic{__cuda_callable__} macro, which is an annotation for functions that should be executable by the CPU as well as GPU.
		Additionally, the lambda expression must have the \ic{mutable} specifier so that the captured variables are modifiable:
		\begin{cppcode*}{autogobble=false, linenos, firstnumber=16}
		// obtain views accessible on the device
		auto x = x_device.getView();
		auto y = y_device.getView();

		// execute the kernel on the device
		TNL::Algorithms::parallelFor<device_t>(
		0, size, // iteration range
		[=] __cuda_callable__ (index_t i) mutable { // kernel function
		y[i] = alpha * x[i] + y[i];
		}
		);
		\end{cppcode*}
		Finally, the result must be copied back to the host memory:
		\begin{cppcode*}{autogobble=false, linenos, firstnumber=27}
		// copy the y array from device memory to host memory
		y_host = y_device;
		}
		\end{cppcode*}
		Programming the variant using expression templates on GPU is left as an exercise to the reader.

		The functions \ic{host_axpy} and \ic{device_axpy} can be combined in a general \ic{parallel_axpy} function as follows.
		The \ic{__CUDACC__} macro is defined by the \ic{nvcc} and \ic{clang++} compilers if and only if a CUDA source code is being compiled.
		Hence, the code might be used in common header files for both \C++ and CUDA source files.
		\begin{cppcode}
		void parallel_axpy(index_t size, value_t alpha, value_t* x, value_t* y)
		{
		#ifdef __CUDACC__
		device_axpy(size, alpha, x, y);
		#else
		host_axpy(size, alpha, x, y);
		#endif
		}
		\end{cppcode}

		not only a parallel programming library, it has larger scope than the aforementioned high-level libraries (linear algebra, data structures, numerical solvers, and much more)
		\subsection{Future work}

		\inline{basic concepts: allocator, executor (device)}
		\inline{basic operations: memory copy, kernel launch}
		\inline{abstractions: ParallelFor, flexible reduction, scan, ...}
		\inline{dependent operations: construct, destruct, set (parallel for), compare (reduction), contains, containsOnlyValue (reduction), expression templates (horizontal -- parallel for, vertical -- reduction)}
		\inline{popsat streamy a ideálně to nějak zobecnit v TNL -- potřeba pro LBM optimalizace}
		As is evident from the preceding description, TNL is an actively developed project with many features in development.
		The most fundamental features planned for the future are backends based on modern frameworks such as HIP \cite{amd:hip} and SYCL \cite{khronos:sycl}.
		Their development might also require refactoring of existing facilities, namely the concept behind the \ic{Device} template parameter should be revisited with respect to the concepts of \emph{execution policies} that are used in other high-level libraries such as Kokkos \cite{Kokkos:ecosystem,Kokkos:programming}, Thrust/STL \cite{thrust:website}, or Ginkgo \cite{ginkgo-toms-2022}.
		Furthermore, aspects such as documentation, unit tests, and benchmarks are continuously being improved.

		The open-source TNL library \cite{oberhuber:2021tnl} simplifies parallelization and distributed computing on GPU clusters.
		TNL natively supports and provides a unified high-level interface for modern parallel architectures such as CPUs, GPU accelerators (via CUDA \cite{nvidia:cuda}) and distributed systems (via MPI \cite{mpi:3.1}).
		Furthermore, TNL provides common building blocks for numerical solvers, including data structures and parallel algorithms for linear algebra, structured grids and unstructured meshes.
		Using the data structures and algorithms from TNL is beneficial for performance, because they allow to avoid running expensive computations on the CPU and having to transfer large datasets between the system memory and accelerators over the PCI-E bus.
		Instead, all expensive parts of the computational algorithm are executed on the GPU accelerators and the CPU is responsible only for the orchestration of the work and occasional sequential steps such as handling input and output.
		Finally, the transition from a monolithic package to a more modular structure is being considered for the TNL project.
		Introducing modules would allow clear separation of responsibilities for each part of the project, clear definition of dependencies on external projects and between the modules, and better overview of their completion status.
		Consequently, TNL developers would have to track the stability of interfaces more carefully, for example using versioned releases that would also improve the usability of TNL in downstream projects.
		This progress already started when the Python bindings for TNL were moved to a separate repository named PyTNL \cite{PyTNL:website}.
		Candidates for new modules are features related to image processing and advanced numerical methods such as FEM, MHFEM, or FVM.

references.bib

+8 −0

Original line number	Diff line number	Diff line
		@@ -339,6 +339,14 @@
		year = {2022},
		}

		@Online{PyTNL:website,
		author = {{TNL project developers}},
		organization = {Czech Technical University, Faculty of Nuclear Sciences and Physical Engineering},
		title = {{PyTNL}: {Python} bindings for the {Template Numerical Library}},
		url = {https://gitlab.com/tnl-project/pytnl},
		year = {2022},
		}

		# NumDwarf paper
		@Article{fucik:2019NumDwarf,
		author = {Fučík, Radek and Klinkovský, Jakub and Solovský, Jakub and Oberhuber, Tomáš and Mikyška, Jiří},

references_parallel_programming.bib

+24 −0

Original line number	Diff line number	Diff line
		@@ -590,6 +590,30 @@
		note = {[Accessed January 2023]},
		}

		@Article{veldhuizen1995,
		title={Expression templates},
		author={Veldhuizen, Todd},
		journal={C++ Report},
		volume={7},
		number={5},
		pages={26--31},
		year={1995},
		}

		@InProceedings{Matsuzaki2010,
		author = {Matsuzaki, Kiminori and Emoto, Kento},
		booktitle = {Implementation and Application of Functional Languages},
		title = {Implementing fusion-equipped parallel skeletons by expression templates},
		year = {2010},
		address = {Berlin, Heidelberg},
		editor = {Morazán, Marco T. and Scholz, Sven-Bodo},
		pages = {72--89},
		publisher = {Springer Berlin Heidelberg},
		abstract = {Developing efficient parallel programs is more difficult and complicated than developing sequential ones. Skeletal parallelism is a promising methodology for easy parallel programming in which users develop parallel programs by composing ready-made components called parallel skeletons. We developed a parallel skeleton library SkeTo that provides parallel skeletons implemented in C++ and MPI for distributed-memory environments. In the new version of the library, the implementation of the parallel skeletons for lists is improved so that the skeletons equip themselves with fusion optimization. The optimization mechanism is implemented based on the programming technique called expression templates. In this paper, we illustrate the improved design and implementation of parallel skeletons for lists in the SkeTo library.},
		doi = {10.1007/978-3-642-16478-1_5},
		isbn = {978-3-642-16478-1},
		}

		@Comment{jabref-meta: databaseType:bibtex;}

		@Comment{jabref-meta: protectedFlag:true;}