@@ -739,22 +739,21 @@ In this study, we fix $L_x = L_y = L_z = \SI{0.25}{\metre}$, same as in the stro
The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.
The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.
The lattice size is scaled according to the formula
The lattice size is scaled according to the formula
where $\round{\cdot}$ denotes rounding to the nearest integer and $s$ is a scaling parameter.
where $\round{\cdot}$ denotes rounding to the nearest integer and $s$ is a scaling parameter.
The factor 32 ensures that the resulting dimensions are multiples of the warp size in order to avoid partially inactive warps in the execution of CUDA kernels.
For the results presented in \cref{tab:lbm:weak scaling 3D}, we chose $s =512$ for single precision in order to obtain the largest possible base lattice ($512\times512\times512$ for one rank) where the best weak scaling efficiency was achieved.
For the results presented in \cref{tab:lbm:weak scaling 3D}, we chose $s =16$ for single precision in order to obtain the largest possible base lattice ($512\times512\times512$ for one rank) where the best weak scaling efficiency was achieved.
For double precision, we had to use a smaller base lattice with $s =256$ due to memory limitations.
For double precision, we had to use a smaller base lattice with $s =8$ due to memory limitations.
The reason is that in this study, the communication size per rank is increasing (it is proportional to $N_y \times N_z$) and storage for data received from neighboring ranks must be allocated, so with $s =512$ and $N_{\mathrm{ranks}}\ge32$ the amount of memory needed per rank would be more than the GPU accelerators have available.
The reason is that in this study, the communication size per rank is increasing (it is proportional to $N_y \times N_z$) and storage for data received from neighboring ranks must be allocated, so with $s =16$ and $N_{\mathrm{ranks}}\ge32$ the amount of memory needed per rank would be more than the GPU accelerators have available.
It can be noticed in \cref{tab:lbm:weak scaling 3D} that $E\!f\!f$ does not behave monotonically: for 1, 8, and 64 ranks it is close to 1, but otherwise it is significantly lower.
It can be noticed in \cref{tab:lbm:weak scaling 3D} that $E\!f\!f$ does not behave monotonically: for 1, 8, and 64 ranks it is close to 1, but otherwise it is significantly lower.
However, this problem is not due to the communication cost---the communication is completely overlapped with computation, but the computation itself is slower than it should be.
However, this problem is not due to the communication cost---the communication is completely overlapped with computation, but the computation itself is slower than it should be.
The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y =256$ for single precision and $B_y =128$ for double precision.
The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y =256$ for single precision and $B_y =128$ for double precision.
It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive warps compared to the optimal case where $N_y$ is a multiple of $B_y$.
It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive threads compared to the optimal case where $N_y$ is a multiple of $B_y$.
\begin{table}[tb]
\begin{table}[tb]
\caption{
\caption{
Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
The lattice size is scaled as $N_x = N_y = N_z =32\,\round*{16\cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z =32\,\round*{8\cbrt{N_{\mathrm{ranks}}}}$ in double precision.
The lattice size is scaled as $N_x = N_y = N_z =\round*{512\cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z =\round*{256\cbrt{N_{\mathrm{ranks}}}}$ in double precision.
Rank 0: rank on node is 0, using GPU id 0 of 8, CUDA_VISIBLE_DEVICES=GPU-c844d872-a080-50c5-0101-87d3aef94fe0,GPU-dc12c5f8-9abc-3135-beb2-1aa9b21698cb,GPU-82b7fef4-5c34-855d-6a3f-7100a0f7c34e,GPU-faa4107b-ced6-fea5-fd15-172129ab1e4e,GPU-cb498b3a-4591-800e-9bdc-d661aa0419d5,GPU-0b6542b5-52cc-fac1-5d6d-0135767f2f20,GPU-795d5724-676b-f0b7-2ad6-983024e4c8bc,GPU-c2d91fd2-9fd4-5951-e0f7-8234d2275d5b
Rank 0: rank on node is 0, using GPU id 0 of 8, CUDA_VISIBLE_DEVICES=GPU-b7391fce-f452-5d35-7f4f-98475c892415,GPU-5540111a-49db-c3e4-1f41-40c14d524b1a,GPU-d6677a3d-817f-0225-7fa0-ed079cfd6d1b,GPU-114d6c00-32ca-dc97-d1c9-e71cb514d0f9,GPU-7670eca4-2d82-e5df-e670-4adec0761af5,GPU-72efabc6-301f-4d22-68d4-2ecba0755385,GPU-e05355b7-922e-4612-a165-37c6f0227ce2,GPU-66bc11ae-c446-850c-c0a5-a37be582905f
CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 128, 128, 128 ]
CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 128, 128, 128 ]
Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB
Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB
Local memory budget analysis / estimation for MPI rank 0
Local memory budget analysis / estimation for MPI rank 0