Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
The lattice size is scaled as $N_x = N_y = N_z =\round*{512\cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z =\round*{256\cbrt{N_{\mathrm{ranks}}}}$ in double precision.
In the second weak scaling analysis, the primary requirement is to keep the physical size of the domain constant.
In this study, we fix $L_x = L_y = L_z =\SI{0.25}{\metre}$, same as in the strong scaling study.
The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.
@@ -750,16 +760,6 @@ However, this problem is not due to the communication cost---the communication i
The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y =256$ for single precision and $B_y =128$ for double precision.
It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive threads compared to the optimal case where $N_y$ is a multiple of $B_y$.
\begin{table}[tb]
\caption{
Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
The lattice size is scaled as $N_x = N_y = N_z =\round*{512\cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z =\round*{256\cbrt{N_{\mathrm{ranks}}}}$ in double precision.
Note that all results presented in this section were computed without utilizing the Nvidia GPUDirect technology \cite{nvidia:GPUDirect} which is still not fully functional on the Karolina supercomputer.
Hence, the effective bandwidth of inter-node communication was limited, because data could not be transferred from the GPU memory directly to the InfiniBand network interface and had to be buffered in the operating system memory instead.
Another problem is that in each computation, InfiniBand was used in multi-rail configuration \cite{UCX:FAQ} with two ports, but the same ports were shared by all ranks on a node and the other two InfiniBand ports remained completely unused.