Commit 43775289 authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

working on the MHFEM section - reorganized tables

- tables include results only on the finest meshes
- different grouping - OpenMP and MPI in the same table, GPU results in
  a separate table
- recomputed benchmarks on the finest 3D mesh with 1-24 cores
parent 6d2092bb
Loading
Loading
Loading
Loading
+19 −69
Original line number Diff line number Diff line
@@ -58,7 +58,8 @@ In this section, we present the results of the generalized McWhorter--Sunada pro

\subsubsection{Verification results}

For the purpose of this thesis, the computational domain, boundary conditions, and physical parameters were chosen identically to \cite{fucik:2019NumDwarf}, and also the meshes listed in \cref{tab:meshes} are the same as the triangular and tetrahedral meshes used for CPU computations in the original paper.
For the purpose of this thesis, the computational domain, boundary conditions, and physical parameters were chosen identically to \cite{fucik:2019NumDwarf}.
The simulations were computed on the meshes listed in \cref{tab:meshes} which are the same as the triangular and tetrahedral meshes used in \cite{fucik:2019NumDwarf}.

\inline{add EOC tables}

@@ -73,6 +74,7 @@ The secondary quantities of interest for the comparison of sequential computatio
\end{equation*}
In case of computations on the GPU, introducing a quantity similar to $E\!f\!f$ does not make sense, because the GPU cores are not independent as the CPU cores.
Hence, the only comparable quantity is the GPU speed-up $GSp_\ell$ defined as the ratio between the computational times on CPU using $\ell$ cores and GPU.
\todo{GPU speed-up is currently not shown in the tables}
Similarly, parallel computations using multiple GPUs can be compared by means of the speed-up $Sp_\ell$ defined as the ratio between computational times using 1 and $\ell$ GPUs.

The computational time of the generalized McWhorter--Sunada problem is governed by the resolution of sparse linear systems rather than operations involving the unstructured mesh.
@@ -92,50 +94,29 @@ When the CPU execution goes from 6 to 12 cores on the finest 2D mesh in \cref{ta
This phenomenon may be caused by an increased cache size, since each core has its own L2 cache of size 1 MiB (see \cref{tab:hardware}) that cannot be utilized by other cores.
In case of problems with memory requirements comparable to the cache size, such as the mentioned computation on the mesh 2D$^\triangle_5$, increasing the number of cores may improve performance more than proportionally to the number of cores, since more data may be readily available in the cache and accesses to the system memory may be avoided.

\begin{table}[!tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 2D.
    }
    \label{tab:comptimes:mcwh2d_omp}
    \centering
    \scalebox{0.84}{
        \input{./data/mcwhdd/comptimes_openmp_2D.tex}
    }
\end{table}

\begin{table}[!tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the MPI-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 2D.
    }
    \label{tab:comptimes:mcwh2d_mpi}
    \centering
    \scalebox{0.85}{
        \input{./data/mcwhdd/comptimes_mpi_2D.tex}
    }
\end{table}
CPU computations distributed across multiple nodes using MPI are compared in \cref{tab:comptimes:mcwh3d_cpu_nodes}.
This strong-scaling performance study was performed only on the finest tetrahedral mesh 3D$^\triangle_5$.
The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 16 nodes at most could be employed in one MPI computation.
The speed-up $Sp$ is calculated with respect to the computation using 12 cores, which was included in \cref{tab:comptimes:mcwh3d_mpi} already.
It can be noticed that the speed-up is often higher than the number of CPUs, which may be again attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:comptimes:mcwh3d_gpu} with \cref{tab:comptimes:mcwh3d_cpu_nodes}, it can be seen that using 1 GPU leads to a faster computational time than when using 8 CPUs (4 nodes) and at least 32 CPUs (16 nodes) are necessary for a faster computational time than when using 4 GPUs.

\begin{table}[!tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the OpenMP-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 3D.
        Computations on the mesh 3D$^\triangle_5$ were not performed for 1 and 2 CPU threads due to computational time limit imposed on the system.
        Comparison of computational times $CT \, [\text{s}]$, speed-up $Sp$, and parallel efficiency $E\!f\!f$ of the OpenMP and MPI-based CPU computations for the generalized McWhorter--Sunada problem on the finest triangular mesh 2D$^\triangle_5$.
    }
    \label{tab:comptimes:mcwh3d_omp}
    \label{tab:mhfem:comptimes:CPU 2D}
    \centering
    \scalebox{0.84}{
        \input{./data/mcwhdd/comptimes_openmp_3D.tex}
    }
    \input{./data/mcwhdd/comptimes_cpu_2D.tex}
\end{table}

\begin{table}[!tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$, parallel efficiency $E\!f\!f$ of the MPI-based CPU computations, and GPU speed-up $GSp_\ell$ for the generalized McWhorter--Sunada problem in 3D.
        Computations on the mesh 3D$^\triangle_5$ were not performed for 1 and 2 CPU cores due to computational time limit imposed on the system.
        Comparison of computational times $CT \, [\text{s}]$, speed-up $Sp$, and parallel efficiency $E\!f\!f$ of the OpenMP and MPI-based CPU computations for the generalized McWhorter--Sunada problem on the finest tetrahedral mesh 3D$^\triangle_5$.
    }
    \label{tab:comptimes:mcwh3d_mpi}
    \label{tab:mhfem:comptimes:CPU 3D}
    \centering
    \scalebox{0.85}{
        \input{./data/mcwhdd/comptimes_mpi_3D.tex}
    }
    \input{./data/mcwhdd/comptimes_cpu_3D.tex}
\end{table}

The comparison of computational times $CT$ and speed-ups $Sp_\ell$ for benchmarks involving multi-GPU computations is shown in \cref{tab:comptimes:mcwh2d_gpu,tab:comptimes:mcwh3d_gpu}.
@@ -150,46 +131,15 @@ The speed-ups could be improved by optimizing the linear system solver (BiCGstab

\begin{table}[tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$ and speed-up $Sp_\ell$ of MPI-based GPU computations for the generalized McWhorter--Sunada problem in 2D.
        Comparison of computational times $CT \, [\text{s}]$, speed-up $Sp$, and parallel efficiency $E\!f\!f$ of MPI-based GPU computations for the generalized McWhorter--Sunada problem on the finest triangular and tetrahedral meshes.
        Each rank manages its dedicated GPU.
    }
    \label{tab:comptimes:mcwh2d_gpu}
    \label{tab:mhfem:comptimes:GPU}
    \centering
    \scalebox{0.85}{
        \input{./data/mcwhdd/comptimes_gpu_2D.tex}
    }
    \input{./data/mcwhdd/comptimes_gpu.tex}
\end{table}

\begin{table}[tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$ and speed-up $Sp_\ell$ of MPI-based GPU computations for the generalized McWhorter--Sunada problem in 3D.
        Each rank manages its dedicated GPU.
    }
    \label{tab:comptimes:mcwh3d_gpu}
    \centering
    \scalebox{0.85}{
        \input{./data/mcwhdd/comptimes_gpu_3D.tex}
    }
\end{table}

CPU computations distributed across multiple nodes using MPI are compared in \cref{tab:comptimes:mcwh3d_cpu_nodes}.
This strong-scaling performance study was performed only on the finest tetrahedral mesh 3D$^\triangle_5$.
The computational cluster used for the computations consists of 20 dual-processor nodes containing CPUs listed in \cref{tab:hardware}, but only 16 nodes at most could be employed in one MPI computation.
The speed-up $Sp$ is calculated with respect to the computation using 12 cores, which was included in \cref{tab:comptimes:mcwh3d_mpi} already.
It can be noticed that the speed-up is often higher than the number of CPUs, which may be again attributed to the increasing total cache size and the problem size staying constant in the strong-scaling study.
Comparing the computational times on the mesh 3D$^\triangle_5$ from \cref{tab:comptimes:mcwh3d_gpu} with \cref{tab:comptimes:mcwh3d_cpu_nodes}, it can be seen that using 1 GPU leads to a faster computational time than when using 8 CPUs (4 nodes) and at least 32 CPUs (16 nodes) are necessary for a faster computational time than when using 4 GPUs.

\begin{table}[tb]
    \caption{
        Comparison of computational times $CT \, [\text{s}]$ of distributed MPI-based CPU computations for the generalized McWhorter--Sunada problem on the finest tetrahedral mesh 3D$^\triangle_5$.
        The speed-up $Sp$ was calculated with respect to the computation using 1 CPU, i.e. 12 cores.
    }
    \label{tab:comptimes:mcwh3d_cpu_nodes}
    \centering
    \scalebox{0.85}{
        \input{./data/mcwhdd/comptimes_cpu_nodes.tex}
    }
\end{table}

The final result shown in this section is the breakdown of the overall computational times from \cref{tab:comptimes:mcwh3d_cpu_nodes} on the finest tetrahedral mesh 3D$^\triangle_5$.
In \cref{tab:comptimes:portions} it can be observed that the major portion corresponds to the sparse linear system solver (BiCGstab) which involves two sparse matrix--vector multiplications, several dot products, and other BLAS-1 operations in every iteration \cite{saad:2003iterative}.
@@ -205,7 +155,7 @@ The remaining operations, such as the sparse matrix assembly and various operati
    }
    \label{tab:comptimes:portions}
    \centering
    \scalebox{0.85}{
    \scalebox{0.95}{
        \begin{tabular}{lrrrrrrr}
            \toprule
            Number of CPU cores                            & 12 & 24 & 48 & 96 & 192 & 288 & 384 \\

data/mcwhdd/README

0 → 100644
+9 −0
Original line number Diff line number Diff line
Original data:
- helios_gpu: research_data/MHFEM/2020.11_tests_with_distributed_meshes/2 DistSpMV with ghost ranges/helios/2_bound_to_cores_type_C/
  (only GPU results)
- rci_2D: research_data/MHFEM/2021.03.08_mcwhdd_benchmark_rci/2D_triangles/
  (only CPU results for 2D)
- rci_3D_multinode: research_data/MHFEM/2021.10.26_mcwhdd_benchmark_rci/3D_tetrahedrons/
  (only CPU results for 3D, >24 ranks)
- rci_3D: research_data/MHFEM/2022.07.28_mcwhdd_benchmark_rci_tnl_recompute/tnl-mhfem/simulation_cases/mcwhdd/
  (only CPU results for 3D, <=24 ranks)
+70 −0
Original line number Diff line number Diff line
@@ -8,63 +8,63 @@
%   \usepackage{stackengine}
%   \usepackage[np]{numprint}

\begin{tabular}{rN{4}{1} N{4}{1} N{1}{1} N{3}{1} N{1}{1} N{3}{1} N{1}{1}}
\begin{tabular}{rN{5}{1} N{2}{1} N{1}{2} N{5}{1} N{2}{1} N{1}{2}}
\toprule

% header row 0
  &  \multicolumn{7}{c}{GPU}
  &  \multicolumn{3}{c}{OpenMP}
  &  \multicolumn{3}{c}{MPI}
  \\
\cmidrule(l){2-8}
\cmidrule(lr){2-4}
\cmidrule(l){5-7}

% header row 1
  &  \multicolumn{1}{c}{1 rank}
  &  \multicolumn{2}{c}{2 ranks}
  &  \multicolumn{2}{c}{3 ranks}
  &  \multicolumn{2}{c}{4 ranks}
  \\
\cmidrule(lr){2-2}
\cmidrule(lr){3-4}
\cmidrule(lr){5-6}
\cmidrule(l){7-8}

% header row 2
\multicolumn{1}{c}{Id.}
  &  \multicolumn{1}{c}{$ CT $}
\multicolumn{1}{c}{Cores}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp_2 $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp_3 $}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp_4 $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  \\

\midrule


        3D$^\triangle_1$
        $ \np{1} $
          &  

10743.9  &  1.0  &  1.00  &  10800.2  &  1.0  &  1.00 \\

        $ \np{2} $
          &  

6349.0  &  1.7  &  0.85  &  5693.5  &  1.9  &  0.95 \\

        $ \np{4} $
          &  

0.3  &  0.5  &  0.6  &  0.7  &  0.5  &  0.8  &  0.4 \\
3375.9  &  3.2  &  0.80  &  3143.0  &  3.4  &  0.86 \\

        3D$^\triangle_2$
        $ \np{6} $
          &  

0.6  &  0.9  &  0.7  &  1.1  &  0.5  &  1.3  &  0.5 \\
2294.6  &  4.7  &  0.78  &  2506.0  &  4.3  &  0.72 \\

        3D$^\triangle_3$
        $ \np{8} $
          &  

3.5  &  4.2  &  0.8  &  4.8  &  0.7  &  5.7  &  0.6 \\
1818.1  &  5.9  &  0.74  &  1787.6  &  6.0  &  0.76 \\

        3D$^\triangle_4$
        $ \np{12} $
          &  

54.2  &  36.7  &  1.5  &  33.4  &  1.6  &  30.6  &  1.8 \\
1296.2  &  8.3  &  0.69  &  1096.8  &  9.8  &  0.82 \\

        3D$^\triangle_5$
        $ \np{24} $
          &  

2654.8  &  1415.4  &  1.9  &  996.7  &  2.7  &  793.3  &  3.3 \\
977.0  &  11.0  &  0.46  &  549.3  &  19.7  &  0.82 \\

\bottomrule
\end{tabular}
+105 −0
Original line number Diff line number Diff line


% The table needs the following to be defined in the preamble:
%   
%   \usepackage{booktabs}
%   \usepackage{multirow}
%   \usepackage{adjustbox}
%   \usepackage{stackengine}
%   \usepackage[np]{numprint}

\begin{tabular}{rrrN{6}{1} N{2}{1} N{1}{2} N{6}{1} N{3}{1} N{1}{2}}
\toprule

% header row 0
  &  
  &  
  &  \multicolumn{3}{c}{OpenMP}
  &  \multicolumn{3}{c}{MPI}
  \\
\cmidrule(lr){4-6}
\cmidrule(l){7-9}

% header row 1
\multicolumn{1}{c}{Cores}
  &  \multicolumn{1}{c}{CPUs}
  &  \multicolumn{1}{c}{Nodes}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  \\

\midrule


$ \np{1} $  &    &    &  
188243.0  &  1.0  &  1.00  &  188706.0  &  1.0  &  1.00 \\

$ \np{2} $  &    &    &  
102074.0  &  1.8  &  0.92  &  93659.1  &  2.0  &  1.01 \\

$ \np{4} $  &    &    &  
55937.6  &  3.4  &  0.84  &  49553.0  &  3.8  &  0.95 \\

$ \np{6} $  &    &    &  
40796.4  &  4.6  &  0.77  &  35594.3  &  5.3  &  0.88 \\

$ \np{8} $  &    &    &  
32026.3  &  5.9  &  0.73  &  28958.6  &  6.5  &  0.81 \\

$ \np{12} $  &  $ \np{1} $  &  $ 1/2 $  &  
26369.7  &  7.1  &  0.59  &  23839.0  &  7.9  &  0.66 \\

$ \np{24} $  &  $ \np{2} $  &  $ \np{1} $  &  
15695.0  &  12.0  &  0.50  &  12184.2  &  15.5  &  0.65 \\

$ \np{48} $  &  $ \np{4} $  &  $ \np{2} $  &  
  &    &    &  6171.4  &  30.6  &  0.64 \\

$ \np{72} $  &  $ \np{6} $  &  $ \np{3} $  &  
  &    &    &  4026.3  &  46.9  &  0.65 \\

$ \np{96} $  &  $ \np{8} $  &  $ \np{4} $  &  
  &    &    &  3016.0  &  62.6  &  0.65 \\

$ \np{120} $  &  $ \np{10} $  &  $ \np{5} $  &  
  &    &    &  2374.4  &  79.5  &  0.66 \\

$ \np{144} $  &  $ \np{12} $  &  $ \np{6} $  &  
  &    &    &  1968.2  &  95.9  &  0.67 \\

$ \np{168} $  &  $ \np{14} $  &  $ \np{7} $  &  
  &    &    &  1643.1  &  114.8  &  0.68 \\

$ \np{192} $  &  $ \np{16} $  &  $ \np{8} $  &  
  &    &    &  1410.4  &  133.8  &  0.70 \\

$ \np{216} $  &  $ \np{18} $  &  $ \np{9} $  &  
  &    &    &  1242.5  &  151.9  &  0.70 \\

$ \np{240} $  &  $ \np{20} $  &  $ \np{10} $  &  
  &    &    &  1114.3  &  169.4  &  0.71 \\

$ \np{264} $  &  $ \np{22} $  &  $ \np{11} $  &  
  &    &    &  1003.8  &  188.0  &  0.71 \\

$ \np{288} $  &  $ \np{24} $  &  $ \np{12} $  &  
  &    &    &  924.2  &  204.2  &  0.71 \\

$ \np{312} $  &  $ \np{26} $  &  $ \np{13} $  &  
  &    &    &  860.5  &  219.3  &  0.70 \\

$ \np{336} $  &  $ \np{28} $  &  $ \np{14} $  &  
  &    &    &  807.3  &  233.8  &  0.70 \\

$ \np{360} $  &  $ \np{30} $  &  $ \np{15} $  &  
  &    &    &  761.6  &  247.8  &  0.69 \\

$ \np{384} $  &  $ \np{32} $  &  $ \np{16} $  &  
  &    &    &  702.4  &  268.7  &  0.70 \\

\bottomrule
\end{tabular}
+55 −0
Original line number Diff line number Diff line
@@ -8,47 +8,48 @@
%   \usepackage{stackengine}
%   \usepackage[np]{numprint}

\begin{tabular}{rrrN{5}{1} N{2}{1}}
\begin{tabular}{rN{3}{1} N{1}{1} N{1}{2} N{4}{1} N{1}{1} N{1}{2}}
\toprule

% header row 0
  &  
  &  
  &  \multicolumn{2}{c}{3D$^\triangle_5$}
  &  \multicolumn{3}{c}{2D$^\triangle_5$}
  &  \multicolumn{3}{c}{3D$^\triangle_5$}
  \\
\cmidrule(l){4-5}
\cmidrule(lr){2-4}
\cmidrule(l){5-7}

% header row 1
\multicolumn{1}{c}{Cores}
  &  \multicolumn{1}{c}{CPUs}
  &  \multicolumn{1}{c}{Nodes}
\multicolumn{1}{c}{GPUs}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  &  \multicolumn{1}{c}{$ CT $}
  &  \multicolumn{1}{c}{$ Sp $}
  &  \multicolumn{1}{c}{$ E\!f\!f $}
  \\

\midrule


$ \np{12} $  &  $ \np{1} $  &  $ 1/2 $  &  
23949.5  &  1.0 \\
        $ \np{1} $
          &  

$ \np{24} $  &  $ \np{2} $  &  $ \np{1} $  &  
12255.5  &  2.0 \\
528.6  &  1.0  &  1.00  &  2654.8  &  1.0  &  1.00 \\

$ \np{48} $  &  $ \np{4} $  &  $ \np{2} $  &  
6171.4  &  3.9 \\
        $ \np{2} $
          &  

566.1  &  0.9  &  0.47  &  1415.4  &  1.9  &  0.94 \\

$ \np{96} $  &  $ \np{8} $  &  $ \np{4} $  &  
3016.0  &  7.9 \\
        $ \np{3} $
          &  

$ \np{192} $  &  $ \np{16} $  &  $ \np{8} $  &  
1410.4  &  17.0 \\
642.5  &  0.8  &  0.27  &  996.7  &  2.7  &  0.89 \\

$ \np{288} $  &  $ \np{24} $  &  $ \np{12} $  &  
924.2  &  25.9 \\
        $ \np{4} $
          &  

$ \np{384} $  &  $ \np{32} $  &  $ \np{16} $  &  
702.4  &  34.1 \\
709.7  &  0.7  &  0.19  &  793.3  &  3.3  &  0.84 \\

\bottomrule
\end{tabular}
Loading