@@ -567,7 +567,7 @@ More formally, the pipelined synchronization as executed by one MPI rank is summ
Note that the implementation relies on the so-called \emph{CUDA-aware} support \cite{kraus:cuda-aware-mpi} from the MPI library in order to avoid having to explicitly copy the send and receive buffers in the steps \ref{algitem:ndarray pipelining:step 1} and \ref{algitem:ndarray pipelining:step 3} to the host memory.
However, the MPI library may still need to internally copy data between the device and host memory when handling the \ic{MPI_Isend} and \ic{MPI_Irecv} calls.
This was actually the case in all simulations performed for this thesis, because we did not have access to any system with functional Nvidia GPUDirect technology \cite{nvidia:GPUDirect}.
This was actually the case in all simulations performed for this thesis, because we did not have access to any system with functional NVIDIA GPUDirect technology \cite{nvidia:GPUDirect}.
Each copy operation started in the steps \ref{algitem:ndarray pipelining:step 1} and \ref{algitem:ndarray pipelining:step 3} involves launching a CUDA kernel that is placed into the same high-priority CUDA stream that was previously used for the processing of the corresponding lattice sites.
However, as mentioned in \cref{sec:distributed ndarray synchronization}, these copy operations can be avoided when the data is already stored in contiguous blocks of memory.
@@ -583,7 +583,7 @@ In general, this can be ensured only with a 1D distribution of the lattice and w
Based on the layout of the four-dimensional array storing the values of discrete distribution functions and assuming the aforementioned 1D decomposition along the $x$-dimension, the CUDA thread block size for LBM computations is selected as $(1, B_y, B_z)$, where the parameters $B_y$ and $B_z$ are determined by \cref{alg:LBM:CUDA thread block size} based on the lattice sizes $N_y$ and $N_z$ to achieve coalesced memory accesses \cite{nvidia:cuda}.
The algorithm does not impose any constraints on $N_y$ and $N_z$ and assumes that the CUDA kernel checks for each thread if it has a corresponding lattice site or is out of bounds.
The thread block size is selected such that $B_y$ is a multiple of 32 (i.e., the \emph{warp size} on contemporary Nvidia GPUs \cite{nvidia:cuda}) and $B_y B_z \le\ic{max_threads}$, a limit that we set empirically as $\ic{max_threads}=256$ for a single-precision data type and $\ic{max_threads}=128$ for a double-precision data type.
The thread block size is selected such that $B_y$ is a multiple of 32 (i.e., the \emph{warp size} on contemporary NVIDIA GPUs \cite{nvidia:cuda}) and $B_y B_z \le\ic{max_threads}$, a limit that we set empirically as $\ic{max_threads}=256$ for a single-precision data type and $\ic{max_threads}=128$ for a double-precision data type.
\begin{algorithm}[CUDA thread block size selection for LBM]
\label{alg:LBM:CUDA thread block size}
@@ -600,7 +600,7 @@ The thread block size is selected such that $B_y$ is a multiple of 32 (i.e., the
% NOTE: the following is the old version with additional constraints
%
%The constraints are that $N_y$ must be a multiple of $B_y$, $N_z$ must be a multiple of $B_z$, and $B_y$ must be a multiple of 32 (i.e., the \emph{warp size} on contemporary Nvidia GPUs \cite{nvidia:cuda}).
%The constraints are that $N_y$ must be a multiple of $B_y$, $N_z$ must be a multiple of $B_z$, and $B_y$ must be a multiple of 32 (i.e., the \emph{warp size} on contemporary NVIDIA GPUs \cite{nvidia:cuda}).
%The algorithm tries to maximize the number of threads in the block up to \ic{max_threads}, a limit that we set empirically as $\ic{max_threads} = 256$ for a single-precision data type and $\ic{max_threads} = 128$ for a double-precision data type.
%
%\begin{algorithm}[CUDA thread block size selection for LBM]
@@ -644,7 +644,7 @@ Most notably, the results reported in this section are specific to the D3Q27 vel
As the first result, we compare the performance of two streaming patterns in our implementation: the A-B pull scheme and the A-A pattern (see \cref{sec:lbm:streaming schemes}).
The benchmark was computed on a fixed domain with $L_x = L_y = L_z =\SI{0.25}{\metre}$, using single or double precision, and without MPI distribution (i.e., using one MPI rank).
The same simulation was computed on several Nvidia GPU accelerators from the three latest architectures (Pascal, Volta/Turing, Ampère) and two product lines (GeForce, Tesla).
The same simulation was computed on several NVIDIA GPU accelerators from the three latest architectures (Pascal, Volta/Turing, Ampère) and two product lines (GeForce, Tesla).
The solver was compiled with the \ic{nvcc} compiler version 11.8 using the flags \ic{-std=c++17 -O3 --use_fast_math} and the host compiler \ic{g++} version 11.3.0.
The results included in \cref{tab:lbm:AB vs AA} show that in all cases, the difference between the A-B pull scheme and the A-A pattern in terms of performance is negligible.
@@ -660,7 +660,7 @@ Hence, the A-A pattern will be used in all LBM simulations presented hereafter.
\begin{table}[tb]
\caption{
Performance comparison of the A-B pull scheme with the A-A streaming pattern in single and double precision on various Nvidia GPU architectures.
Performance comparison of the A-B pull scheme with the A-A streaming pattern in single and double precision on various NVIDIA GPU architectures.
The lattice size is $256\times256\times256$ in all cases.
}
\label{tab:lbm:AB vs AA}
@@ -762,7 +762,7 @@ However, this problem is not due to the communication cost---the communication i
The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y =256$ for single precision and $B_y =128$ for double precision.
It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive threads compared to the optimal case where $N_y$ is a multiple of $B_y$.
Note that all results presented in this section were computed without utilizing the Nvidia GPUDirect technology \cite{nvidia:GPUDirect} which is still not fully functional on the Karolina supercomputer.
Note that all results presented in this section were computed without utilizing the NVIDIA GPUDirect technology \cite{nvidia:GPUDirect} which is still not fully functional on the Karolina supercomputer.
Hence, the effective bandwidth of inter-node communication was limited, because data could not be transferred from the GPU memory directly to the InfiniBand network interface and had to be buffered in the operating system memory instead.
Another problem is that in each computation, InfiniBand was used in multi-rail configuration \cite{UCX:FAQ} with two ports, but the same ports were shared by all ranks on a node and the other two InfiniBand ports remained completely unused.
We are not certain if this is a hardware or software problem and how it should be addressed.
@@ -51,9 +51,9 @@ All components of the method are described with the objective to formulate the c
Two streaming schemes based on the A-B and A-A patterns are explained in detail and tested in a performance benchmark.
The optimizations include overlapping computation and communication, pipelining for operations in the distributed data synchronization, and using CUDA streams and MPI functions efficiently in the implementation.
A small computational benchmark was performed on several Nvidia GeForce and Nvidia Tesla GPUs to evaluate the difference between the A-B and A-A streaming patterns.
A small computational benchmark was performed on several NVIDIA GeForce and NVIDIA Tesla GPUs to evaluate the difference between the A-B and A-A streaming patterns.
In almost all cases, the A-A pattern performed better or slightly worse in terms of GLUPS, but its main advantage are halved memory requirements for the storage of discrete distribution function values.
A larger computational benchmark was performed for the A-A pattern on the Karolina supercomputer using accelerated nodes with 8 Nvidia A100 GPUs each.
A larger computational benchmark was performed for the A-A pattern on the Karolina supercomputer using accelerated nodes with 8 NVIDIA A100 GPUs each.
The results show good strong scalability up to 8 nodes (64 GPUs) using a $512\times512\times512$ lattice in single as well as double precision.
Two weak scaling studies with 1D and 3D domain expansion show almost ideal scalability up to 16 nodes (128 GPUs), which is the most that were tried in the benchmark.
@@ -88,9 +88,9 @@ The model relies on experimental data for the specification of boundary conditio
Based on the presented validation study, we can draw reasonable predictions about the flow and transport behavior inside the computational domain.
The performance of the coupled solver depends on the selected lattice and mesh sizes (i.e., spatial resolution) and the adaptively selected time steps.
The highest-resolution simulations, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 Nvidia Tesla A100 cards to simulate \SI{100}{\second} of physical time.
The highest-resolution simulations, which compare the best to the experimental data, require about 200~GiB memory and \SI{15.25}{\hour} computational time on 8 NVIDIA Tesla A100 cards to simulate \SI{100}{\second} of physical time.
The simulations in lower resolutions are not as accurate, but require less memory and shorter computational time compared to the highest resolution.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 Nvidia Tesla A100 cards.
A strong scaling analysis was performed for a lower resolution giving a parallel efficiency of 80\% on 8 NVIDIA Tesla A100 cards.
Scalability problems that are likely to occur on large-scale supercomputers were not investigated due to the availability of computational resources.
The presented results suggest several key areas where future experimental efforts could be improved, allowing the analysis of this model's performance to be extended and further explored.
@@ -17,7 +17,7 @@ However, it may be difficult to combine different packages and libraries, or eve
\addcontentsline{toc}{section}{State of the Art}
Contemporary supercomputers are based on several different hardware platforms and each has an associated parallel programming framework providing native performance.
For example, CUDA \cite{nvidia:cuda} targets Nvidia GPU accelerators, ROCm \cite{amd:rocm} targets AMD GPU accelerators, and TBB \cite{oneTBB} targets multi-core processors.
For example, CUDA \cite{nvidia:cuda} targets NVIDIA GPU accelerators, ROCm \cite{amd:rocm} targets AMD GPU accelerators, and TBB \cite{oneTBB} targets multi-core processors.
In order to simplify parallel programming and provide performance portability, high-level libraries such as Thrust \cite{thrust:website}, Kokkos \cite{Kokkos:ecosystem}, or TNL \cite{oberhuber:2021tnl} are being developed.
Additionally, established scientific libraries such as Trilinos \cite{trilinos} or PETSc \cite{petsc-web-page} are being updated in order to support GPU accelerators via suitable interfaces.
However, this often requires a complete redesign of an algorithm due to hardware platforms having different requirements on optimal data layout.
@@ -38,7 +38,7 @@ In particular, software projects such as deal.II~\cite{bangerth:2007deal.II}, DU
Additionally, the lattice Boltzmann method (LBM) has become popular for turbulent flow simulations \cite{kang2013,geier2015cumulant,kumar2018,peng2018,wittmann2013,zakirov2021}.
While many of the aforementioned projects are open-source and thus can be freely modified, it is difficult to incorporate novel approaches and methods into extensive software packages, especially for external users.
Hence, significant stream of innovation originates from small separate projects that gradually either evolve into larger projects, or get incorporated in existing software.
This thesis builds on the author's previous work \cite{klinkovsky:2017thesis}, an implementation of the mixed-hybrid finite element method for multiphase compositional flow in porous media, which can utilize Nvidia GPU accelerators.
This thesis builds on the author's previous work \cite{klinkovsky:2017thesis}, an implementation of the mixed-hybrid finite element method for multiphase compositional flow in porous media, which can utilize NVIDIA GPU accelerators.
Additionally, the work extends the LBM implementation that is developed at the Department of Mathematics, FNSPE CTU in Prague.
\section*{Research goals}
@@ -120,7 +120,7 @@ The development of schemes for adaptive meshes and non-uniform lattice discretiz
The performance and efficiency of the developed solvers was tested on the largest computing systems that were available to the author at that time.
However, many supercomputers in the world are much larger and the scaling of the presented solvers on such systems should be investigated.
Furthermore, hardware as well as software platforms powering contemporary supercomputers are continuously evolving.
The TNL library supports parallelization via OpenMP and CUDA for multi-core CPUs and Nvidia GPUs, but the implementation of new backends (e.g., using HIP and SYCL) is needed to provide support for other hardware platforms.
The TNL library supports parallelization via OpenMP and CUDA for multi-core CPUs and NVIDIA GPUs, but the implementation of new backends (e.g., using HIP and SYCL) is needed to provide support for other hardware platforms.
Finally, the TNL library has proved to be an effective tool for the development of scalable and flexible numerical solvers.
Its development requires continuous effort in order to keep up with the world.
@@ -142,7 +142,7 @@ The library also provides multiple sparse matrix formats, including CSR and Ellp
PARALUTION uses MPI \cite{mpi:3.1} for distributed computing and provides several back-ends for parallel execution: OpenMP \cite{OpenMP}, CUDA \cite{nvidia:cuda}, OpenCL \cite{khronos:opencl}.
It was also ported to the ROCm \cite{amd:rocm} platform as rocALUTION \cite{rocALUTION}.
AmgX \cite{nvidia:amgx,naumov:2015AmgX} is an open-source library of GPU accelerated algebraic multigrid and preconditioned iterative methods developed by Nvidia.
AmgX \cite{nvidia:amgx,naumov:2015AmgX} is an open-source library of GPU accelerated algebraic multigrid and preconditioned iterative methods developed by NVIDIA.
Besides algebraic multigrid, it provides also standard Krylov subspace methods (CG, BiCGstab, GMRES) and several preconditioners/smoothers (block Jacobi, Gauss--Seidel, ILU, polynomial).
AmgX uses CUDA \cite{nvidia:cuda} for GPU acceleration and MPI \cite{mpi:3.1} for distributed computing.
The algebraic multigrid implementation is partially based on the Hypre library.
@@ -18,7 +18,7 @@ Ghost regions around the interface between subdomains can be generated and mesh
We used two benchmark problems to verify the efficiency of the implemented data structure and to compare it against an existing state--of--the--art library.
The benchmarks show that the performance of mesh-based algorithms depends on the configured data types, but not on the presence or omission of unneeded internal data structures, such as sparse matrices, tag arrays, etc.
Furthermore, the results of all benchmark problems demonstrate that computations on the GPU (Nvidia Tesla V100) are significantly faster than the computations on the CPU (Intel Xeon Gold 6146).
Furthermore, the results of all benchmark problems demonstrate that computations on the GPU (NVIDIA Tesla V100) are significantly faster than the computations on the CPU (Intel Xeon Gold 6146).
In case of the calculation of cell measures, the implementation on GPU is up to 10 (2D) or 105 (3D) times faster than a sequential CPU implementation, and up to 1.2 (2D) or 9.5 (3D) times faster than a 12-threaded OpenMP implementation on CPU.
The second benchmark involving calculations of curve lengths or surface areas has different limiting factors and the GPU implementation is up to 140 (2D) or 390 (3D) faster than sequential CPU implementation, and up to 13 (2D) or 37 (3D) times faster than 12-threaded OpenMP implementation on CPU.
In contrast, implementations of these two benchmarks using the MOAB~\cite{tautges:2004moab} library are available only for the execution on CPU and they are multiple times slower than the implementations using the presented data structure on CPU.