improved introduction (db5b3ea4) · Commits · Jakub Klinkovský / Dissertation

content/introduction.tex

+35 −15

Original line number	Diff line number	Diff line
		Mathematical modeling of fluid dynamics has many ecological, medical and industrial applications and it is one of the central research topics investigated at the Department of Mathematics, FNSPE CTU in Prague in collaboration with prominent domestic as well as foreign institutions.
		In order to accurately model complex natural processes governing the behavior of fluids, it is often necessary to employ advanced numerical methods and to run high-resolution simulations on large computational clusters or supercomputers.
		In order to accurately model complex natural processes governing the behavior of fluids, it is often necessary to employ advanced numerical methods capable of treating multiscale and multiphysics cases.
		Multiscale modeling refers to techniques that resolve fundamental physical processes at many different temporal and/or spatial scales.
		On the other hand, multiphysics models comprise several parts describing specific aspects of a large system, such as thermal distribution in a flowing fluid.
		Both factors bring additional challenges to the development of mathematical models as well as numerical methods that can be applied.

		Accurate numerical simulations in high resolution are possible only on large computational clusters or supercomputers.
		However, using the facilities for high-performance computing efficiently is non-trivial, as it requires careful management of data in the computer memory and appropriate division of the computations between all available processing units.
		Especially when designing algorithms for systems with GPU accelerators, which provide significantly more processing units as well as memory levels compared to traditional computational systems, specifics of the hardware architectures have to be considered.
		Especially when designing algorithms for systems with GPU accelerators, which provide significantly more processing units as well as memory levels compared to traditional computational systems, specifics of the hardware architectures have to be considered in the software design.
		Due to the diversity of parallel computing platforms, it is desirable for scientists to use established high-level libraries that provide a portable and easy to use interface for common operations.
		However, it may be difficult to combine different packages and libraries, or even to gain sufficient overview of the available options.

		Since the field of computational fluid dynamics includes many substantially different applications, the most important requirements imposed on the building blocks of numerical solvers are configurability and composability.
		The former allows to adapt a piece of software, such as a data structure, for a specific application.
		The latter allows to combine multiple existing pieces together to resolve a new use case.
		Similarly, it is desirable to facilitate coupling even high-level tools and methods \todo{what was meant by this?} in order to provide more general solvers for multiphysics problems that arise in practice.
		With these two design aspects, low-level components can be used to develop tools and solvers, which are also configurable and composable on a higher level.
		On the highest level, a multiphysics solver typically comprises a hierarchy of components that are coupled together and configured for the needs of a specific problem.

		The purpose of this thesis is the development of efficient data structures and parallel algorithms that can be used as fundamental ingredients in solvers based on, e.g., the lattice Boltzmann method or the mixed-hybrid finite element method.
		Development of data structures for organizing structured as well as unstructured data in numerical simulations is necessary in order to match the requirements for efficient utilization of modern supercomputers.
		Similarly, high-performance parallel algorithms have to be designed specifically for the hardware architectures used by modern GPU accelerators.
		Due to the diversity of parallel computing platforms, it is desirable for scientists to use established high-level libraries that provide a portable and easy to use interface for common operations.
		However, it may be difficult to combine different packages and libraries, or even to gain sufficient overview of the available options.
		This thesis pursues topics from two levels of interest.
		At one level, it deals with the development of fundamental building blocks, such as efficient and reusable data structures and parallel algorithms.
		Specifically, the data structures described in this thesis allow to organize structured as well as unstructured data in numerical simulations according to the requirements for efficient utilization of modern supercomputers.
		At the other level, these building blocks are used as fundamental ingredients for the development of advanced numerical solvers in computational fluid dynamics.
		The solvers described in this thesis are based on two main numerical methods, namely the mixed-hybrid finite element method and the lattice Boltzmann method.
		Both methods are thoroughly tested separately from each other and then a coupled computational approach based on both methods is introduced.
		Finally, the resulting coupled solver is applied in practice for the simulation of water vapor transport in turbulent air flow.

		\section*{State of the art}
		\addcontentsline{toc}{section}{State of the Art}
		@@ -36,11 +46,15 @@ Compared to other types, polyhedral meshes need smaller number of cells to cover

		Numerous computational tools based on numerical methods such as finite volumes or finite elements are available for solving partial differential equations originating from mathematical modeling of various biological, environmental, or industrial problems.
		In particular, software projects such as deal.II~\cite{bangerth:2007deal.II}, DUNE~\cite{bastian:2006DUNE}, OpenFOAM~\cite{jasak:2007openfoam}, TOUGH2~\cite{pruess:1999TOUGH2}, MFiX~\cite{syamlal:1993}, ANSYS Fluent~\cite{ansys-fluent:2009} or COMSOL Multiphysics~\cite{COMSOLMultiphysics1998} are suitable for simulations in the field of computational fluid dynamics.
		All the aforementioned projects provide some parallel computing capabilities.
		While efficient implementations of the finite volume and finite element methods for GPU accelerators are available \cite{Cecka2011,Fu2014,bauer:2016,Castro2010,Castro2011,Zhang2023}, the aforementioned projects provide only limited or no support for GPU-accelerated computations.
		Additionally, the lattice Boltzmann method (LBM) has become popular for turbulent flow simulations \cite{kang2013,geier2015cumulant,kumar2018,peng2018,wittmann2013,zakirov2021}.
		Unlike traditional numerical approaches such as finite volume or finite element methods, the parallelization of the LBM algorithm is simpler and most computational software that employs LBM also supports computations on GPU accelerators.

		While many of the aforementioned projects are open-source and thus can be freely modified, it is difficult to incorporate novel approaches and methods into extensive software packages, especially for external users.
		Hence, significant stream of innovation originates from small separate projects that gradually either evolve into larger projects, or get incorporated into existing software.
		This thesis builds on the author's previous work \cite{klinkovsky:2017thesis}, an implementation of the mixed-hybrid finite element method for multiphase compositional flow in porous media, which can utilize NVIDIA GPU accelerators.
		Additionally, the work extends the LBM implementation that is developed at the Department of Mathematics, FNSPE CTU in Prague.
		Many such research projects were started separately at the Department of Mathematics, FNSPE CTU in Prague, including an in-house code implementing the lattice Boltzmann method, and the author's previous work \cite{klinkovsky:2017thesis}, an implementation of the mixed-hybrid finite element method for multiphase compositional flow in porous media.
		The work described in this thesis integrates, extends and generalizes several such components.

		\section*{Research goals}
		\addcontentsline{toc}{section}{Research Goals}
		@@ -88,7 +102,7 @@ The thesis presents the following novel results:
		\textbf{Scalable implementation of the lattice Boltzmann method (LBM) for supercomputers based on GPU accelerators.}
		The implementation is based on a distributed multidimensional array data structure and an MPI synchronizer for distributed data, which are implemented in the TNL library.
		Strong scaling as well as weak scaling studies were performed on the Karolina supercomputer \cite{it4i:karolina}.
		The lattice Boltzmann method code is developed by the research group at the Czech Technical University in Prague and the author's contributions are included in the publications \cite{fucik2019,fucik2020,fucik:lbmat,eichler2022}.
		The lattice Boltzmann method code is developed by the research group at the Department of Mathematics, FNSPE CTU in Prague and the author's contributions are included in the publications \cite{fucik2019,fucik2020,fucik:lbmat,eichler2022}.
		\item
		\textbf{Coupled computational approach based on the combination of lattice Boltzmann method and mixed-hybrid finite element method.}
		We consider a model based on the Navier--Stokes equations (solved by LBM) coupled with a linear advection--diffusion equation (discretized using MHFEM) and analyze properties of the numerical scheme and performance of its implementation.
		@@ -127,7 +141,13 @@ Finally, the TNL library has proved to be an effective tool for the development
		Its development requires continuous effort in order to keep up with the world.
		The most important general directions of future development, as viewed by the author, are interoperability with other libraries, improving the modular structure of the project, and using formal \emph{concepts} in the \CPPtwenty standard to improve type-checking, documentation, and compiler diagnostics.

		\section*{System of notation}
		\addcontentsline{toc}{section}{System of notation}

		\inline{subscripts and superscripts; vectors and matrices/tensors; 3D: $\vec v = [v_x, v_y, v_z]^T$ implies that $v_x \equiv v_1$, $v_y \equiv v_2$, and $v_z \equiv v_3$}
		%\section*{System of notation}
		%\addcontentsline{toc}{section}{System of notation}
		%
		%This section summarizes common notation used in this thesis.
		%Each vector quantity is typeset using a bold italic symbol (e.g.,~$\vec v$).
		%In general, vector components are denoted
		%
		%Each matrix or tensor quantity is typeset using a bold upright symbol (e.g.,~$\matrix A$).
		%
		%\inline{subscripts and superscripts; vectors and matrices/tensors; 3D: $\vec v = [v_x, v_y, v_z]^T$ implies that $v_x \equiv v_1$, $v_y \equiv v_2$, and $v_z \equiv v_3$}

references.bib

+69 −0

Original line number	Diff line number	Diff line
		@@ -1585,6 +1585,75 @@
		publisher = {Elsevier},
		}

		@Article{Cecka2011,
		author = {Cecka, Cris and Lew, Adrian J. and Darve, E.},
		journal = {International Journal for Numerical Methods in Engineering},
		title = {Assembly of finite element methods on graphics processors},
		year = {2011},
		issn = {0029-5981},
		month = feb,
		number = {5},
		pages = {640--669},
		volume = {85},
		abstract = {Abstract Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are created and analyzed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor using single-precision arithmetic achieves speedups of 30 or more in comparison to a well optimized double-precision single core implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite element discretization.},
		doi = {10.1002/nme.2989},
		publisher = {John Wiley \& Sons, Ltd.},
		}

		@Article{Fu2014,
		author = {Fu, Zhisong and James Lewis, T. and Kirby, Robert M. and Whitaker, Ross T.},
		journal = {Journal of Computational and Applied Mathematics},
		title = {Architecting the finite element method pipeline for the {GPU}},
		year = {2014},
		issn = {0377-0427},
		pages = {195--211},
		volume = {257},
		abstract = {The finite element method (FEM) is a widely employed numerical technique for approximating the solution of partial differential equations (PDEs) in various science and engineering applications. Many of these applications benefit from fast execution of the FEM pipeline. One way to accelerate the FEM pipeline is by exploiting advances in modern computational hardware, such as the many-core streaming processors like the graphical processing unit (GPU). In this paper, we present the algorithms and data-structures necessary to move the entire FEM pipeline to the GPU. First we propose an efficient GPU-based algorithm to generate local element information and to assemble the global linear system associated with the FEM discretization of an elliptic PDE. To solve the corresponding linear system efficiently on the GPU, we implement a conjugate gradient method preconditioned with a geometry-informed algebraic multigrid (AMG) method preconditioner. We propose a new fine-grained parallelism strategy, a corresponding multigrid cycling stage and efficient data mapping to the many-core architecture of GPU. Comparison of our on-GPU assembly versus a traditional serial implementation on the CPU achieves up to an 87× speedup. Focusing on the linear system solver alone, we achieve a speedup of up to 51× versus use of a comparable state-of-the-art serial CPU linear system solver. Furthermore, the method compares favorably with other GPU-based, sparse, linear solvers.},
		doi = {10.1016/j.cam.2013.09.001},
		publisher = {Elsevier},
		}

		@Article{Zhang2023,
		author = {Zhang, Xi and Guo, Xiaohu and Weng, Yue and Zhang, Xianwei and Lu, Yutong and Zhao, Zhong},
		journal = {Future Generation Computer Systems},
		title = {Hybrid {MPI} and {CUDA} paralleled finite volume unstructured {CFD} simulations on a multi-{GPU} system},
		year = {2023},
		issn = {0167-739X},
		pages = {1--16},
		volume = {139},
		abstract = {Porting unstructured Computational Fluid Dynamics (CFD) analysis of compressible flow to Graphics Processing Units (GPUs) confronts two difficulties. Firstly, non-coalescing access to the GPU’s global memory is induced by indirect data access leading to performance loss. Secondly, data exchange among multi-GPU is complex due to data communication between processes and transfer between host and device, which degrades scalability. For increasing data locality on unstructured finite volume GPU simulations for compressible flow, we perform some optimizations, including cell and face renumbering, data dependence resolving, nested loops split, and loop mode adjustment. Then, a hybrid MPI-CUDA parallel framework with packing and unpacking exchange data on GPU is established for multi-GPU computing. Finally, after optimizations, the performance of the whole application on a GPU is increased by around 50\%. Simulations of ONERA M6 cases on a single GPU (Nvidia Tesla V100) can achieve an average of 13.4 speedup compared to those on 28 CPU cores (Intel Xeon Gold 6132). On the baseline of 2 GPUs, strong scaling results show a parallel efficiency of 42\% on 200 GPUs, while weak scaling tests give a parallel efficiency of 82.4\% up to 200 GPUs.},
		doi = {10.1016/j.future.2022.09.005},
		publisher = {Elsevier},
		}

		@Article{Castro2011,
		author = {Castro, Manuel J. and Ortega, Sergio and de la Asunción, Marc and Mantas, José M. and Gallardo, José M.},
		journal = {Comptes Rendus Mécanique},
		title = {{GPU} computing for shallow water flow simulation based on finite volume schemes},
		year = {2011},
		issn = {1631-0721},
		number = {2},
		pages = {165--184},
		volume = {339},
		abstract = {This article is a review of the work that we are carrying out to efficiently simulate shallow water flows. In this paper, we focus on the efficient implementation of path-conservative Roe type high-order finite volume schemes to simulate shallow flows that are supposed to be governed by the one-layer or two-layer shallow water systems, formulated under the form of a conservation law with source terms. The implementation of the scheme is carried out on Graphics Processing Units (GPUs), thus achieving a substantial improvement of the speedup with respect to normal CPUs. Finally, some numerical experiments are presented.},
		doi = {10.1016/j.crme.2010.12.004},
		publisher = {Elsevier},
		}

		@Article{Castro2010,
		author = {Castro, Manuel J. and Ortega, Sergio and de la Asunción, Marc and Mantas, José M.},
		journal = {SeMA Journal},
		title = {On the benefits of using {GPUs} to simulate shallow flows with finite volume schemes},
		year = {2010},
		issn = {2254-3902},
		number = {1},
		pages = {27--44},
		volume = {50},
		abstract = {In this paper, we focus on the efficient implementation of path conservative Roe type high order finite volume schemes to simulate shallow flows. The motion of a layer of homogeneous non-viscous fluid is supposed to be governed by the shallow-water system, formulated under the form of a conservation law with source terms. The implementation of the scheme is carried out on Graphics Processing Units (GPUs), thus achieving a substantial improvement of the speedup with respect to normal CPUs. Finally, some numerical experiments are presented.},
		doi = {10.1007/BF03322540},
		publisher = {Springer},
		}

		@Comment{jabref-meta: databaseType:bibtex;}

		@Comment{jabref-meta: protectedFlag:true;}