Commit d87d7cb3 authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

conclusion - added summary of all chapters

parent e5a77f3c
Loading
Loading
Loading
Loading
+0 −3
Original line number Diff line number Diff line
@@ -259,9 +259,6 @@ For a given regular lattice and an unstructured mesh covering the domain $\Omega
The result of this decomposition procedure is illustrated in \cref{fig:lbm-mhfem:non-uniform decomposition}.
Overall, the decomposition algorithm optimizes the computational cost and memory requirements of each MPI rank at the cost of increased communication due to increased number of lattice subdomains.

\later{Future work: problem of mapping MPI ranks to GPUs -- quadratic assignment problem, plus we need to get the weights (communication cost between each pair of GPUs) somehow.}


\section{Numerical analysis}
\label{sec:lbm-mhfem:numerical analysis}

+0 −2
Original line number Diff line number Diff line
@@ -623,8 +623,6 @@ The thread block size is selected such that $B_y$ is a multiple of 32 (i.e., the
%    \end{algsteps}
%\end{algorithm}

\later{Future work: multidimensional domain decomposition}

\section{Computational benchmark results}
\label{sec:lbm:results}

content/conclusion.tex

0 → 100644
+103 −0
Original line number Diff line number Diff line
The following sections summarize the results presented in the previous chapters and potential future research directions.

\section{Programming techniques for modern parallel architectures}

In \cref{chapter:programming and architectures}, we introduced various approaches for programming modern parallel architectures.
First, contemporary high-performance computing systems were introduced in a brief summary and the \C++ programming language was selected for the work presented in the thesis.
Then, we presented an overview of common parallel programming frameworks and illustrated the implementation of a parallel \emph{axpy} operation in each framework.
The Message Passing Interface for distributed computing was also briefly described.
Subsequently, we described several high-level parallel programming libraries with backend systems that provide performance portability across multiple hardware platforms and/or parallel programming frameworks.
Examples of the parallel \emph{axpy} operation in each library are included for comparison with the frameworks.
Finally, the Template Numerical Library was introduced and its features, design, and future work were described.

\section{Data structures}

In \cref{chapter:data structures}, two efficient and configurable data structures from the Template Numerical Library were described: multidimensional arrays and unstructured meshes.
The section related to multidimensional arrays amends the implementation described in the author's Master's thesis \cite{klinkovsky:2017thesis} and thus it focuses on describing the extension for distributed computing.

The section related to unstructured meshes provides a detailed description of the design and implementation of the data structure published in \cite{klinkovsky:2022meshes} and its later extension for polygonal and polyhedral meshes.
The data structure is designed around sparse matrix formats for the representation of incidence matrices, supports computations on CPUs as well as GPUs, and supports distributed computing via MPI.
Its efficiency was compared using several benchmark problems based on simple parallel algorithms.
Compared to the data structure available in the MOAB library, the primary benchmark using the TNL data structure is about $13\times$ faster for triangular meshes, $5\times$ faster for tetrahedral meshes, $10\times$ faster for polygonal meshes, and $6\times$ faster for polyhedral meshes.
However, for the alternative benchmark requiring more information from the mesh data structure, the factor rises up to $130\times$ for tetrahedral meshes.
Furthermore, the results indicate good GPU utilization for all benchmarks and mesh types, including polygonal and polyhedral.

\section{Solution of sparse linear systems}

In \cref{chapter:linear systems}, we presented a review of iterative methods and preconditioning techniques for the solution of sparse linear systems.
Additionally, we presented an overview of software packages implementing related algorithms and described corresponding features available in the TNL library.
Finally, we described details related to the implementations of distributed sparse matrices in the TNL and Hypre libraries.

\section{Mixed-hybrid finite element method}

In \cref{chapter:MHFEM}, we described a numerical scheme for the solution of a system of partial differential equations in a general coefficient form, called \emph{NumDwarf}.
The scheme is based on the mixed-hybrid finite element method (MHFEM) and the discontinuous Galerkin method for spatial discretization, the Euler method for temporal discretization and the semi-implicit approach of the frozen coefficients method for the linearization in time.
The scheme was developed in \cite{fucik:2019NumDwarf} originally for multicomponent flow and transport phenomena in porous media.
The chapter builds on the paper \cite{fucik:2019NumDwarf} and author's Master's thesis \cite{klinkovsky:2017thesis}, but includes more details and variants of the scheme, such as implicit upwind stabilization for advective terms.
The implementation of the scheme relies on the TNL library, especially the data structure for unstructured meshes and all parallel computing capabilities.

The chapter briefly introduces the mathematical model of incompressible two-phase flow in porous media and the generalized McWhorter--Sunada problem, which is used as a benchmark problem to analyze the accuracy and computational performance of the scheme.
The verification results show first order of accuracy in the $L_1$ and $L_2$ norms in 2D and 3D.
The benchmark was computed in several configurations with varying parameters, namely the linear system solver, hardware architecture, and programming framework for CPU parallelization.
The results show the importance of a good solver for systems of linear equations, which takes most of the computational time.
The BiCGstab method with the BoomerAMG preconditioner from the Hypre library performs several times faster than the Jacobi-preconditioned BiCGstab method on CPU.
Both variants have similar strong scalability based on the increasing number of CPU cores.
For GPU computations, the difference between the performance of the BoomerAMG and Jacobi preconditioners is not as significant, but the former still performs better and its advantage might be even more considerable for larger meshes.

\section{Lattice Boltzmann method}

In \cref{chapter:LBM}, we described the implementation of the lattice Boltzmann method and performance optimizations necessary for its efficient use on distributed systems with GPU accelerators.
All components of the method are described with the objective to formulate the computational algorithm.
Two streaming schemes based on the A-B and A-A patterns are explained in detail and tested in a performance benchmark.
The optimizations include overlapping computation and communication, pipelining for operations in the distributed data synchronization, and using CUDA streams and MPI functions efficiently in the implementation.

A small computational benchmark was performed on several Nvidia GeForce and Nvidia Tesla GPUs to evaluate the difference between the A-B and A-A streaming patterns.
In almost all cases, the A-A pattern performed better or slightly worse in terms of GLUPS, but its main advantage are halved memory requirements for the storage of discrete distribution function values.
A larger computational benchmark was performed for the A-A pattern on the Karolina supercomputer using accelerated nodes with 8 Nvidia A100 GPUs each.
The results show good strong scalability up to 8 nodes (64 GPUs) using a $512 \times 512 \times 512$ lattice in single as well as double precision.
Two weak scaling studies with 1D and 3D domain expansion show almost ideal scalability up to 16 nodes (128 GPUs), which is the most that were tried in the benchmark.

An important limitation of the implemented solver is the one-dimensional domain decomposition; implementation of a multi-dimensional decomposition algorithm is planned for the future.
The performance of the solver in the strong scaling study is further limited due to dysfunctional GPUDirect technology on the supercomputer.
Another problem that may be important for modern supercomputers is the optimal mapping of MPI ranks to GPUs, which would consider non-uniform communication costs between each pair of GPUs.
This leads to the quadratic assignment problem, which is NP-hard, and the weights approximating communication costs need to be measured experimentally or provided by a network topology--aware hardware introspection tool.

\section{Coupled LBM-MHFEM computational approach}

In \cref{chapter:LBM-MHFEM}, we presented a coupled computational approach based on the combination of lattice Boltzmann and mixed-hybrid finite element methods for the solution of the Navier--Stokes equations coupled with a general system of advection--diffusion--reaction partial differential equations.
The work presented in this chapter explores new possibilities for efficient solution of various multiphysics problems using modern hardware architectures.

Numerical details are provided for the coupled computational algorithm, time adaptivity, and interpolation of the velocity field.
Thanks to the TNL library, the solver can utilize modern GPU-based high-performance computing systems.
For optimal utilization of computational resources, we designed a domain decomposition algorithm for overlapped lattice and mesh, which allows to optimize the computational cost and memory requirements of each MPI rank at the cost of increased communication due to increased number of lattice subdomains.
The decomposition algorithm is essentially one-dimensional and its generalization to improve the scalability on large supercomputers may be subject of future research.

A simple benchmark problem based on a highly turbulent velocity field and a linear advection--diffusion equation with an analytical solution was designed to analyze the accuracy of the coupled numerical scheme.
The results show that the numerical schemes for the conservative and non-conservative forms of the advection--diffusion equation do not behave equivalently.
Thanks to a term that compensates for non-zero velocity divergence, the accuracy of the non-conservative scheme is better by about 10 orders of magnitude in the benchmark.
Furthermore, the benchmark shows small differences between linear and cubic interpolation schemes and between explicit and implicit upwind stabilization for advective terms.

\section{Mathematical modeling of vapor transport in air}

In \cref{chapter:vapor transport}, we used the coupled computational approach developed in the previous chapter to simulate vapor transport in the boundary layer above a partially saturated soil.
The chapter concludes the multidisciplinary work presented in this thesis with mutual collaboration between experimental and computational methodologies.

The mathematical model was validated with experimental data measured above a flat partially saturated soil layer featuring synthetic plants arranged in several configurations.
The experimental dataset used in this study was generated by \cite{trautz2017development,trautz2017role} and is publicly available in \cite{trautz:dataset}.
The model relies on experimental data for the specification of boundary conditions: the inflow velocity and humidity profiles and the average mass flux of water loss from the plants.
Based on the presented validation study, we can draw reasonable predictions about the flow and transport behavior inside the computational domain.

The performance of the coupled solver depends on the selected lattice and mesh sizes (i.e., spatial resolution) and the adaptively selected time steps.
The highest-resolution simulations, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 Nvidia Tesla A100 cards to simulate \SI{100}{\second} of physical time.
The simulations in lower resolutions are not as accurate, but require less memory and shorter computational time compared to the highest resolution.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 Nvidia Tesla A100 cards.
Scalability problems that are likely to occur on large-scale supercomputers were not investigated due to the availability of computational resources.

The presented results suggest several key areas where future experimental efforts could be improved, allowing the analysis of this model's performance to be extended and further explored.
For example, extending measurements with flow characteristics in the transverse direction (e.g., $v_y$, RMS$_y$, $\overline{v'_x v'_y}$, $\overline{v'_y v'_z}$) would allow us to compare the turbulent kinetic energy and improve the fluctuating inflow velocity condition for the simulations.
Another possible improvement is to arrange measurements in horizontal profiles in regions behind the plants, which would allow us to study the convergence of the numerical method (i.e, the effect of mesh resolution) by comparing the horizontal location of the vortical structures.
Last but not least, the applicability of the measured evaporative mass flux to the close spacing scenario EX-1 should be investigated.
Improving the methodology for measuring the evaporation from the plants would allow for prescribing more appropriate boundary conditions.

The presented simulator for vapor transport in air is just a first application of the coupled LBM-MHFEM approach and could be extended into a more general software tool capable of solving other physical phenomena such as non-isothermal flow, multicomponent flow, land-atmospheric interaction, etc.
There are many potential applications in combination with the experimental research, such as developing an efficient tool for a sensitivity analysis of measurements, supplementing sparse experimental datasets in regions where measurements would be too expensive or unfeasible, or predicting the behavior of the studied system in virtual scenarios.
+15 −1
Original line number Diff line number Diff line
@@ -10,6 +10,7 @@

\cleardoublepage
\chapter{Data Structures}
\label{chapter:data structures}

Efficient data structures play an important role in high-performance computing, because they affect where each data is stored in the computer memory and how quickly it can be accessed.
Hence, data structures and algorithms often have to be designed collectively in order to utilize the most efficient access pattern on given hardware architecture.
@@ -54,4 +55,17 @@ In this chapter, we present several data structures implemented in the Template
\addcontentsline{toc}{chapter}{Conclusion}
\label{sec:conclusion}

\inline{TODO}
% reset section counter and remove chapter number from it
\setcounter{section}{0}
\let\thesectionBackup\thesection
\renewcommand*{\thesection}{\arabic{section}}
% GOTCHA: hyperref uses the \theHsection counter
% https://tex.stackexchange.com/a/71174
\let\theHsectionBackup\theHsection
\renewcommand*{\theHsection}{conclusion.\the\value{section}}

\input{content/conclusion.tex}

% restore numbering
\let\thesection\thesectionBackup
\let\theHsection\theHsectionBackup
+1 −35
Original line number Diff line number Diff line
@@ -6,9 +6,8 @@ The experimental methodology was developed by Andrew Trautz and Tissa Illangasek
The chapter is organized as follows.
First, the motivation and introduction to the mathematical modeling of vapor transport in air is described in \cref{sec:WT:introduction}.
The following \cref{sec:WT:problem formulation} provides a general description of the experiments, mathematical model and boundary conditions.
Then \cref{sec:WT:computational methodology} gives specific details of the computational methodology using the coupled LBM-MHFEM solver and \cref{sec:WT:validation results} presents the validation results of our model.
Then, \cref{sec:WT:computational methodology} gives specific details of the computational methodology using the coupled LBM-MHFEM solver and the final \cref{sec:WT:validation results} presents the validation results of our model.
The model is compared both qualitatively and quantitatively to experimental data measured in three configurations resulting in different flow regimes.
Finally, the achieved results and future work are summarized in \cref{sec:WT:concusion}.

\section{Introduction}
\label{sec:WT:introduction}
@@ -754,36 +753,3 @@ In the cases EX-2 and EX-3 featuring different flow regimes, the graphs in \cref
    }
    \label{fig:plot1D:WT02_3:rh}
\end{figure}

\FloatBarrier

\section{Concluding remarks}
\label{sec:WT:concusion}

\inline{revise conclusion for the thesis}

In this paper, we presented an efficient computational method for vapor transport in the boundary layer above a partially saturated soil.
The solver is based on the combination of lattice Boltzmann and mixed-hybrid finite element methods and can utilize modern GPU-based high-performance computing systems.
The paper deals with mutual collaboration between experimental and computational methodologies.

The model was validated with experimental data measured above a flat partially saturated soil layer featuring synthetic plants arranged in several configurations.
The model relies on experimental data for the input for boundary conditions: the inflow velocity and humidity profiles and the average mass flux of water loss from the plants; experimental data used in this study were generated by \cite{trautz2017development,trautz2017role} and are publicly available in \cite{trautz:dataset}.

Based on the validation study presented in this paper, we can draw reasonable predictions about the flow and transport behavior inside the computational domain.
The performance of the coupled solver depends on the selected lattice and mesh sizes (i.e., spatial resolution) and the adaptively selected time steps.
The highest-resolution simulations presented in this paper, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 Nvidia Tesla A100 cards to simulate \SI{100}{\second} of physical time.
The simulations in lower resolutions are not as accurate, but require less memory and shorter computational time compared to the highest resolution.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 Nvidia Tesla A100 cards.
Scalability problems that are likely to occur on large-scale supercomputers (e.g., due to one-dimensional decomposition of the domain) were not investigated here due to the availability of computational resources.
The generalization of the domain decomposition procedure from \cref{sec:decomposition} to improve the scalability on large supercomputers may be subject of future research.

The results presented herein suggest several key areas where future experimental efforts could be improved, allowing the analysis of this model's performance to be extended and further explored.
For example, extending measurements with flow characteristics in the transverse direction (e.g., $v_y$, RMS$_y$, $\overline{v'_x v'_y}$, $\overline{v'_y v'_z}$) would allow us to compare the turbulent kinetic energy and improve the fluctuating inflow velocity condition for the simulations.
Another possible improvement is to arrange measurements in horizontal profiles in regions behind the plants, which would allow us to study the convergence of the numerical method (i.e, the effect of mesh resolution) by comparing the horizontal location of the vortical structures.
Last but not least, the applicability of the measured evaporative mass flux to the close spacing scenario EX-1 should be investigated.
Improving the methodology for measuring the evaporation from the plants would allow for prescribing more appropriate boundary conditions.

The work presented in this paper explores new possibilities in the efficient solution of various multiphysics problems using modern hardware architectures.
The developed model is based on the combination of LBM for fluid flow and MHFEM for a general system of advection-diffusion-reaction PDEs.
The simulator for vapor transport in air is just a first application that could be extended into a more general software tool capable of solving other physical phenomena such as non-isothermal flow, multicomponent flow, land-atmospheric interaction, etc.
There are many potential applications in combination with the experimental research, such as developing an efficient tool for a sensitivity analysis of measurements, supplementing sparse experimental datasets in regions where measurements would be too expensive or unfeasible, or predicting the behavior of the studied system in virtual scenarios.