\inline{high-level parallelisation approaches such as OpenMP or OpenACC cannot compete with native CUDA in terms of performance~\cite{balogh:2017comparison}}
\section{Low-level overview}
\subsection{OpenMP and thread support in STL (multicore CPUs)}
\inline{popsat streamy a ideálně to nějak zobecnit v TNL -- potřeba pro LBM optimalizace}
The open-source TNL library \cite{oberhuber:2021tnl} simplifies parallelization and distributed computing on GPU clusters.
TNL natively supports and provides a unified high-level interface for modern parallel architectures such as CPUs, GPU accelerators (via CUDA \cite{nvidia:cuda}) and distributed systems (via MPI \cite{mpi:3.1}).
Furthermore, TNL provides common building blocks for numerical solvers, including data structures and parallel algorithms for linear algebra, structured grids and unstructured meshes.
Using the data structures and algorithms from TNL is beneficial for performance, because they allow to avoid running expensive computations on the CPU and having to transfer large datasets between the system memory and accelerators over the PCI-E bus.
Instead, all expensive parts of the computational algorithm are executed on the GPU accelerators and the CPU is responsible only for the orchestration of the work and occasional sequential steps such as handling input and output.
The world entered the exascale era of computing in 2022 when the Frontier supercomputer \cite{enwiki:Frontier} was installed with a performance of \SI{1.102}{exaFLOPS} measured in the LINPACK benchmarks suite \cite{top500:Frontier,top500:june2022}.
This thesis originated in the pre-exascale era of computing and does not actually target the world's fastest supercomputers.
The fastest computing system available to the author during his work on the thesis was the Karolina supercomputer \cite{it4i:karolina}, which was ranked 85th place in the TOP500 list in November 2022 \cite{top500:Karolina} as the fastest supercomputer in the Czech Republic, with a performance of \SI{6.75}{petaFLOPS} measured in the LINPACK benchmarks suite \cite{top500:Karolina}.
Since 2007 \cite{nvidia:cuda1.0,Huang2008}, graphical processing units (GPUs) have been used more and more as general-purpose computing accelerators.
Over time, they evolved into powerful and efficient massively parallel devices that drive the computational performance of contemporary supercomputers thanks to their energy- and cost-efficiency \cite{Cebrin2012,Wu2014,Mittal2014,Anzt2015,Bridges2016,Qasaimeh2019}.
As of November 2022, 7 out of 10 most powerful supercomputers according to the TOP500 list are based on GPU accelerators \cite{top500:november2022}.
Parallel compute accelerators such as GPUs are based on a conceptually different hardware architecture compared to traditional processors based on the x86/x86-64 architectures.
Much has been written about characteristics and evolution of these platforms \cite{Huang2008,Aamodt2018,Bridges2016}.
This thesis does not aim to repeat the summary, nevertheless, we need to highlight the main GPU hardware features:
\begin{itemize}
\item high number of simpler compute units (cores),
\item smaller caches and orientation to data-parallel applications,
\item groups of compute units and stacks of high-bandwidth global memory organized in scalable hierarchies.
\end{itemize}
Overall, GPU accelerators are advantageous for compute-bound as well as memory-bound data parallel applications.
To make the collection of individual accelerators and compute nodes scalable to the size of supercomputers, fast and scalable interconnections between the individual units are necessary.
Technologies such as NVLink and NVSwitch \cite{nvidia:2020A100,Li2020} allow for high-throughput and low-latency communication between GPUs in a single node and inter-node communication typically relies on switched fabric network interconnections.
State--of--the--art solutions for high-performance computing provide transfer speeds up to \SI{100}{Gbit\per\second} per link \cite{enwiki:InfiniBand}, latency around \SI{0.5}{\micro\second}\cite{enwiki:InfiniBand}, remote direct memory access (RDMA) capabilities to minimize CPU overhead \cite{Potluri2013,Li2020}, and acceleration of collective communication operations \cite{Graham2010,Schneider2013}.
Compute nodes can be organized in various network topologies such as fat tree or dragonfly with multiple levels of network switches and links on higher levels can be aggregated in order to increase the overall bandwidth of the network.
\inline{this thesis and GPUs (based on the above, GPUs are necessary for high-performance computing on modern systems; it should be possible to "upscale" most of the algorithms and data structures developed to larger computing systems thanks to similarity between GPU architectures)}
\inline{high-level parallelisation approaches such as OpenMP or OpenACC cannot compete with native CUDA in terms of performance~\cite{balogh:2017comparison}}
\section{Low-level overview}
\subsection{OpenMP and thread support in STL (multicore CPUs)}
\inline{popsat streamy a ideálně to nějak zobecnit v TNL -- potřeba pro LBM optimalizace}
The open-source TNL library \cite{oberhuber:2021tnl} simplifies parallelization and distributed computing on GPU clusters.
TNL natively supports and provides a unified high-level interface for modern parallel architectures such as CPUs, GPU accelerators (via CUDA \cite{nvidia:cuda}) and distributed systems (via MPI \cite{mpi:3.1}).
Furthermore, TNL provides common building blocks for numerical solvers, including data structures and parallel algorithms for linear algebra, structured grids and unstructured meshes.
Using the data structures and algorithms from TNL is beneficial for performance, because they allow to avoid running expensive computations on the CPU and having to transfer large datasets between the system memory and accelerators over the PCI-E bus.
Instead, all expensive parts of the computational algorithm are executed on the GPU accelerators and the CPU is responsible only for the orchestration of the work and occasional sequential steps such as handling input and output.
author={Huang, Q. and Huang, Z. and Werstein, P. and Purvis, M.},
booktitle={Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies},
title={{GPU} as a general purpose computing resource},
year={2008},
pages={151--158},
publisher={IEEE},
doi={10.1109/PDCAT.2008.38},
issn={2379-5352},
}
@InProceedings{Cebrin2012,
author={Cebri'n, J. M. and Guerrero, G. D. and García, J. M.},
booktitle={2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops \& PhD Forum},
title={Energy efficiency analysis of {GPUs}},
year={2012},
pages={1014--1022},
publisher={IEEE},
doi={10.1109/IPDPSW.2012.124},
}
@InProceedings{Anzt2015,
author={Anzt, Hartwig and Tomov, Stanimire and Dongarra, Jack},
booktitle={Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores},
title={Energy efficiency and performance frontiers for sparse computations on {GPU} supercomputers},
year={2015},
address={New York, NY, USA},
pages={1–10},
publisher={Association for Computing Machinery},
series={PMAM '15},
abstract={In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.},
doi={10.1145/2712386.2712387},
isbn={9781450334044},
location={San Francisco, California},
numpages={10},
}
@InProceedings{Wu2014,
author={Wu, Q. and Ha, Y. and Kumar, A. and Luo, S. and Li, A. and Mohamed, S.},
booktitle={2014 International Symposium on Integrated Circuits (ISIC)},
title={A heterogeneous platform with {GPU} and {FPGA} for power efficient high performance computing},
year={2014},
pages={220--223},
publisher={IEEE},
doi={10.1109/ISICIR.2014.7029447},
issn={2325-0631},
}
@InProceedings{Qasaimeh2019,
author={Qasaimeh, M. and Denolf, K. and Lo, J. and Vissers, K. and Zambreno, J. and Jones, P. H.},
booktitle={2019 IEEE International Conference on Embedded Software and Systems (ICESS)},
title={Comparing energy efficiency of {CPU}, {GPU} and {FPGA} implementations for vision kernels},
year={2019},
pages={1--8},
publisher={IEEE},
doi={10.1109/ICESS.2019.8782524},
}
@Article{Mittal2014,
author={Mittal, Sparsh and Vetter, Jeffrey S.},
journal={ACM Computing Surveys},
title={A survey of methods for analyzing and improving {GPU} energy efficiency},
year={2014},
issn={0360-0300},
month={aug},
number={2},
volume={47},
abstract={Recent years have witnessed phenomenal growth in the computational capabilities and applications of GPUs. However, this trend has also led to a dramatic increase in their power consumption. This article surveys research works on analyzing and improving energy efficiency of GPUs. It also provides a classification of these techniques on the basis of their main research idea. Further, it attempts to synthesize research works that compare the energy efficiency of GPUs with other computing systems (e.g., FPGAs and CPUs). The aim of this survey is to provide researchers with knowledge of the state of the art in GPU power management and motivate them to architect highly energy-efficient GPUs of tomorrow.},
address={New York, NY, USA},
articleno={19},
doi={10.1145/2636342},
issue_date={January 2015},
numpages={23},
publisher={Association for Computing Machinery},
}
@Book{Aamodt2018,
author={Aamodt, Tor M. and Fung, Wilson Wai Lun and Rogers, Timothy G.},
series={Synthesis Lectures on Computer Architecture},
doi={10.2200/S00848ED1V01Y201804CAC044},
issn={1935-3235},
pages={1--140},
}
@Article{Bridges2016,
author={Bridges, Robert A. and Imam, Neena and Mintz, Tiffany M.},
journal={ACM Computing Surveys},
title={Understanding {GPU} power: A survey of profiling, modeling, and simulation methods},
year={2016},
issn={0360-0300},
month={sep},
number={3},
volume={49},
abstract={Modern graphics processing units (GPUs) have complex architectures that admit exceptional performance and energy efficiency for high-throughput applications. Although GPUs consume large amounts of power, their use for high-throughput applications facilitate state-of-the-art energy efficiency and performance. Consequently, continued development relies on understanding their power consumption. This work is a survey of GPU power modeling and profiling methods with increased detail on noteworthy efforts. As direct measurement of GPU power is necessary for model evaluation and parameter initiation, internal and external power sensors are discussed. Hardware counters, which are low-level tallies of hardware events, share strong correlation to power use and performance. Statistical correlation between power and performance counters has yielded worthwhile GPU power models, yet the complexity inherent to GPU architectures presents new hurdles for power modeling. Developments and challenges of counter-based GPU power modeling are discussed. Often building on the counter-based models, research efforts for GPU power simulation, which make power predictions from input code and hardware knowledge, provide opportunities for optimization in programming or architectural design. Noteworthy strides in power simulations for GPUs are included along with their performance or functional simulator counterparts when appropriate. Last, possible directions for future research are discussed.},
address={New York, NY, USA},
articleno={41},
doi={10.1145/2962131},
issue_date={September 2017},
numpages={27},
publisher={Association for Computing Machinery},
}
@Online{enwiki:InfiniBand,
author={{Wikipedia contributors}},
title={InfiniBand --- {Wikipedia}{,} The Free Encyclopedia},
author={Li, A. and Song, S. L. and Chen, J. and Li, J. and Liu, X. and Tallent, N. R. and Barker, K. J.},
journal={IEEE Transactions on Parallel and Distributed Systems},
title={Evaluating modern {GPU} interconnect: {PCIe}, {NVLink}, {NV-SLI}, {NVSwitch} and {GPUDirect}},
year={2020},
issn={1558-2183},
number={1},
pages={94--110},
volume={31},
call-number={31},
doi={10.1109/TPDS.2019.2928289},
}
@InProceedings{Potluri2013,
author={Potluri, S. and Hamidouche, K. and Venkatesh, A. and Bureddy, D. and Panda, D. K.},
booktitle={2013 42nd International Conference on Parallel Processing},
title={Efficient inter-node {MPI} communication using {GPUDirect} {RDMA} for {InfiniBand} clusters with {NVIDIA} {GPUs}},
year={2013},
pages={80--89},
publisher={IEEE},
doi={10.1109/ICPP.2013.17},
issn={2332-5690},
}
@InProceedings{Graham2010,
author={Graham, R. L. and Poole, S. and Shamis, P. and Bloch, G. and Bloch, N. and Chapman, H. and Kagan, M. and Shahar, A. and Rabinovitz, I. and Shainer, G.},
booktitle={2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing},
title={{ConnectX}-2 {InfiniBand} management queues: first investigation of the new support for network offloaded collective operations},
year={2010},
pages={53--62},
publisher={IEEE},
doi={10.1109/CCGRID.2010.9},
}
@InProceedings{Schneider2013,
author={Schneider, T. and Hoefler, T. and Grant, R. E. and Barrett, B. W. and Brightwell, R.},
booktitle={2013 42nd International Conference on Parallel Processing},
title={Protocols for fully offloaded collective operations on accelerated network adapters},