LBM chapter - updated results of the weak scaling 3D study (acda2843) · Commits · Jakub Klinkovský / Dissertation

content/LBM.tex

+6 −7

Original line number	Original line	Diff line number	Diff line
	@@ -739,22 +739,21 @@ In this study, we fix $L_x = L_y = L_z = \SI{0.25}{\metre}$, same as in the stro
	The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.		The global discrete problem size is scaled in all three spatial dimensions such that the lattice size $N_x \times N_y \times N_z$ is (approximately) proportional to the number of ranks.
	The lattice size is scaled according to the formula		The lattice size is scaled according to the formula
	\begin{equation}		\begin{equation}
	N_x = N_y = N_z = 32 \, \round*{s \cbrt{N_{\mathrm{ranks}}}},		N_x = N_y = N_z = \round*{s \cbrt{N_{\mathrm{ranks}}}},
	\end{equation}		\end{equation}
	where $\round{\cdot}$ denotes rounding to the nearest integer and $s$ is a scaling parameter.		where $\round{\cdot}$ denotes rounding to the nearest integer and $s$ is a scaling parameter.
	The factor 32 ensures that the resulting dimensions are multiples of the warp size in order to avoid partially inactive warps in the execution of CUDA kernels.		For the results presented in \cref{tab:lbm:weak scaling 3D}, we chose $s = 512$ for single precision in order to obtain the largest possible base lattice ($512 \times 512 \times 512$ for one rank) where the best weak scaling efficiency was achieved.
	For the results presented in \cref{tab:lbm:weak scaling 3D}, we chose $s = 16$ for single precision in order to obtain the largest possible base lattice ($512 \times 512 \times 512$ for one rank) where the best weak scaling efficiency was achieved.		For double precision, we had to use a smaller base lattice with $s = 256$ due to memory limitations.
	For double precision, we had to use a smaller base lattice with $s = 8$ due to memory limitations.		The reason is that in this study, the communication size per rank is increasing (it is proportional to $N_y \times N_z$) and storage for data received from neighboring ranks must be allocated, so with $s = 512$ and $N_{\mathrm{ranks}} \ge 32$ the amount of memory needed per rank would be more than the GPU accelerators have available.
	The reason is that in this study, the communication size per rank is increasing (it is proportional to $N_y \times N_z$) and storage for data received from neighboring ranks must be allocated, so with $s = 16$ and $N_{\mathrm{ranks}} \ge 32$ the amount of memory needed per rank would be more than the GPU accelerators have available.
	It can be noticed in \cref{tab:lbm:weak scaling 3D} that $E\!f\!f$ does not behave monotonically: for 1, 8, and 64 ranks it is close to 1, but otherwise it is significantly lower.		It can be noticed in \cref{tab:lbm:weak scaling 3D} that $E\!f\!f$ does not behave monotonically: for 1, 8, and 64 ranks it is close to 1, but otherwise it is significantly lower.
	However, this problem is not due to the communication cost---the communication is completely overlapped with computation, but the computation itself is slower than it should be.		However, this problem is not due to the communication cost---the communication is completely overlapped with computation, but the computation itself is slower than it should be.
	The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y = 256$ for single precision and $B_y = 128$ for double precision.		The CUDA thread block size selected by \cref{alg:LBM:CUDA thread block size} is $(1, B_y, 1)$ for all lattice sizes used in this study, where $B_y = 256$ for single precision and $B_y = 128$ for double precision.
	It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive warps compared to the optimal case where $N_y$ is a multiple of $B_y$.		It can be noticed that the performance of the LBM algorithm is decreased for cases where $N_y$ is not a multiple of $B_y$ due to inactive threads compared to the optimal case where $N_y$ is a multiple of $B_y$.

	\begin{table}[tb]		\begin{table}[tb]
	\caption{		\caption{
	Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.		Weak scaling with 3D domain expansion in single and double precision on the Karolina supercomputer.
	The lattice size is scaled as $N_x = N_y = N_z = 32 \, \round{16 \cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z = 32 \, \round{8 \cbrt{N_{\mathrm{ranks}}}}$ in double precision.		The lattice size is scaled as $N_x = N_y = N_z = \round{512 \cbrt{N_{\mathrm{ranks}}}}$ in single precision and $N_x = N_y = N_z = \round{256 \cbrt{N_{\mathrm{ranks}}}}$ in double precision.
	}		}
	\label{tab:lbm:weak scaling 3D}		\label{tab:lbm:weak scaling 3D}
	\centering		\centering

data/lbm/scaling_on_Karolina/README

+3 −1

Original line number	Original line	Diff line number	Diff line
	data from		data from

	~/Bbox/Research/2022 IT4I - Karolina/2022.12.10 LBM performance benchmark without MPI communication over periodic boundary/		1. strong scaling: ~/Bbox/Research/2022 IT4I - Karolina/2022.12.10 LBM performance benchmark without MPI communication over periodic boundary/
			2. weak scaling 1D: ~/Bbox/Research/2022 IT4I - Karolina/2022.12.10 LBM performance benchmark without MPI communication over periodic boundary/
			3. weak scaling 3D: ~/Bbox/Research/2022 IT4I - Karolina/2022.12.18 LBM performance benchmark with arbitrary lattice size/

	but logs minified with this command:		but logs minified with this command:

data/lbm/scaling_on_Karolina/make_tables.py

+2 −2

Original line number	Original line	Diff line number	Diff line
	@@ -244,8 +244,8 @@ df_weak_1D = select_threads(df_weak_1D_all)
	df_weak_3D = select_threads(df_weak_3D_all)		df_weak_3D = select_threads(df_weak_3D_all)

	# add columns with N_x for SP and DP		# add columns with N_x for SP and DP
	df_weak_3D["SP (256 threads)", "[512,512,512]cbrt(nproc)", "N_x"] = 32 round(512 / 32 * df_weak_3D.index.to_frame()["nproc"] ** (1/3)).astype("int32")		df_weak_3D["SP (256 threads)", "[512,512,512]cbrt(nproc)", "N_x"] = round(512 df_weak_3D.index.to_frame()["nproc"] ** (1/3)).astype("int32")
	df_weak_3D["DP (128 threads)", "[256,256,256]cbrt(nproc)", "N_x"] = 32 round(256 / 32 * df_weak_3D.index.to_frame()["nproc"] ** (1/3)).astype("int32")		df_weak_3D["DP (128 threads)", "[256,256,256]cbrt(nproc)", "N_x"] = round(256 df_weak_3D.index.to_frame()["nproc"] ** (1/3)).astype("int32")
	# sort the columns		# sort the columns
	df_weak_3D.sort_index(axis=1, inplace=True, key=_sort_key)		df_weak_3D.sort_index(axis=1, inplace=True, key=_sort_key)

data/lbm/scaling_on_Karolina/weak scaling 3D/sim_2, DP, AA pattern, 128 threads/np_01.log

+82 −72

Original line number	Original line	Diff line number	Diff line
	[acn05.karolina.it4i.cz:120779] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]], socket 0[core 8[hwt 0]], socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]], socket 0[core 12[hwt 0]], socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]], socket 0[core 16[hwt 0]], socket 0[core 17[hwt 0]], socket 0[core 18[hwt 0]], socket 0[core 19[hwt 0]], socket 0[core 20[hwt 0]], socket 0[core 21[hwt 0]], socket 0[core 22[hwt 0]], socket 0[core 23[hwt 0]], socket 0[core 24[hwt 0]], socket 0[core 25[hwt 0]], socket 0[core 26[hwt 0]], socket 0[core 27[hwt 0]], socket 0[core 28[hwt 0]], socket 0[core 29[hwt 0]], socket 0[core 30[hwt 0]], socket 0[core 31[hwt 0]], socket 0[core 32[hwt 0]], socket 0[core 33[hwt 0]], socket 0[core 34[hwt 0]], socket 0[core 35[hwt 0]], socket 0[core 36[hwt 0]], socket 0[core 37[hwt 0]], socket 0[core 38[hwt 0]], socket 0[core 39[hw: [B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B/B][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
			Currently Loaded Modules:
			1) GCCcore/10.2.0 6) GCC/10.2.0
			2) zlib/1.2.11-GCCcore-10.2.0 7) CUDAcore/11.6.0
			3) binutils/2.35-GCCcore-10.2.0 8) UCX/1.12.0-GCC-10.2.0-CUDA-11.6.0
			4) numactl/2.0.13-GCCcore-10.2.0 9) OpenMPI/4.1.2-NVHPC-22.2-CUDA-11.6.0
			5) NVHPC/22.2



			[acn54.karolina.it4i.cz:61145] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
	--------------------------------------------------------------------------		--------------------------------------------------------------------------
	WARNING: There was an error initializing an OpenFabrics device.		WARNING: There was an error initializing an OpenFabrics device.

	Local host: acn05		Local host: acn54
	Local device: mlx5_0		Local device: mlx5_0
	--------------------------------------------------------------------------		--------------------------------------------------------------------------
	Rank 0: rank on node is 0, using GPU id 0 of 8, CUDA_VISIBLE_DEVICES=GPU-c844d872-a080-50c5-0101-87d3aef94fe0,GPU-dc12c5f8-9abc-3135-beb2-1aa9b21698cb,GPU-82b7fef4-5c34-855d-6a3f-7100a0f7c34e,GPU-faa4107b-ced6-fea5-fd15-172129ab1e4e,GPU-cb498b3a-4591-800e-9bdc-d661aa0419d5,GPU-0b6542b5-52cc-fac1-5d6d-0135767f2f20,GPU-795d5724-676b-f0b7-2ad6-983024e4c8bc,GPU-c2d91fd2-9fd4-5951-e0f7-8234d2275d5b		Rank 0: rank on node is 0, using GPU id 0 of 8, CUDA_VISIBLE_DEVICES=GPU-b7391fce-f452-5d35-7f4f-98475c892415,GPU-5540111a-49db-c3e4-1f41-40c14d524b1a,GPU-d6677a3d-817f-0225-7fa0-ed079cfd6d1b,GPU-114d6c00-32ca-dc97-d1c9-e71cb514d0f9,GPU-7670eca4-2d82-e5df-e670-4adec0761af5,GPU-72efabc6-301f-4d22-68d4-2ecba0755385,GPU-e05355b7-922e-4612-a165-37c6f0227ce2,GPU-66bc11ae-c446-850c-c0a5-a37be582905f
	CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 128, 128, 128 ]		CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 128, 128, 128 ]
	Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB		Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB
	Local memory budget analysis / estimation for MPI rank 0		Local memory budget analysis / estimation for MPI rank 0
	@@ -24,30 +34,30 @@ LBM block 0: local=[128,128,128], offset=[0,0,0]

	START: simulation NSE:CUM lbmVisc 4.000000e-02 physDl 1.984127e-03 physDt 1.049803e-02		START: simulation NSE:CUM lbmVisc 4.000000e-02 physDl 1.984127e-03 physDt 1.049803e-02
	at t=0.00s, iterations=0 l1error_phys=2.288207e-04 l2error_phys=2.149036e-03 stopping=4.369234e+03		at t=0.00s, iterations=0 l1error_phys=2.288207e-04 l2error_phys=2.149036e-03 stopping=4.369234e+03
	at t=10.00s, iterations=953 l1error_phys=8.274164e-05 l2error_phys=7.972284e-04 stopping=1.087651e+04		at t=5.01s, iterations=477 l1error_phys=9.056865e-05 l2error_phys=9.212067e-04 stopping=9.936466e+03
	GLUPS=2.340 iter=953 t=10.005s dt=1.05e-02 lbmVisc=4.00e-02 WT=1s ETA=8s		GLUPS=1.147 iter=477 t=5.008s dt=1.05e-02 lbmVisc=4.00e-02 WT=1s ETA=8s
	at t=20.01s, iterations=1906 l1error_phys=7.847638e-05 l2error_phys=6.985870e-04 stopping=1.019355e+04		at t=10.00s, iterations=953 l1error_phys=8.274164e-05 l2error_phys=7.972284e-04 stopping=9.668036e+03
	GLUPS=2.714 iter=1906 t=20.009s dt=1.05e-02 lbmVisc=4.00e-02 WT=2s ETA=6s		GLUPS=2.670 iter=953 t=10.005s dt=1.05e-02 lbmVisc=4.00e-02 WT=1s ETA=5s
	at t=30.00s, iterations=2858 l1error_phys=7.840769e-05 l2error_phys=6.997345e-04 stopping=8.927193e+03		at t=15.00s, iterations=1429 l1error_phys=7.952075e-05 l2error_phys=7.154019e-04 stopping=8.802240e+03
	GLUPS=2.715 iter=2858 t=30.003s dt=1.05e-02 lbmVisc=4.00e-02 WT=2s ETA=5s		GLUPS=2.670 iter=1429 t=15.002s dt=1.05e-02 lbmVisc=4.00e-02 WT=2s ETA=4s
	at t=40.01s, iterations=3811 l1error_phys=7.840212e-05 l2error_phys=6.995974e-04 stopping=7.652452e+03		at t=20.01s, iterations=1906 l1error_phys=7.847638e-05 l2error_phys=6.985870e-04 stopping=7.645226e+03
	GLUPS=2.714 iter=3811 t=40.008s dt=1.05e-02 lbmVisc=4.00e-02 WT=3s ETA=5s		GLUPS=2.671 iter=1906 t=20.009s dt=1.05e-02 lbmVisc=4.00e-02 WT=2s ETA=3s
	at t=50.00s, iterations=4763 l1error_phys=7.840092e-05 l2error_phys=6.995902e-04 stopping=6.377174e+03		at t=25.01s, iterations=2382 l1error_phys=7.838814e-05 l2error_phys=6.992238e-04 stopping=6.378230e+03
	GLUPS=2.715 iter=4763 t=50.002s dt=1.05e-02 lbmVisc=4.00e-02 WT=4s ETA=4s		GLUPS=2.674 iter=2382 t=25.006s dt=1.05e-02 lbmVisc=4.00e-02 WT=2s ETA=2s
	at t=60.01s, iterations=5716 l1error_phys=7.840084e-05 l2error_phys=6.995893e-04 stopping=5.101784e+03		at t=30.00s, iterations=2858 l1error_phys=7.840769e-05 l2error_phys=6.997345e-04 stopping=5.101355e+03
	GLUPS=2.715 iter=5716 t=60.007s dt=1.05e-02 lbmVisc=4.00e-02 WT=5s ETA=3s		GLUPS=2.671 iter=2858 t=30.003s dt=1.05e-02 lbmVisc=4.00e-02 WT=3s ETA=2s
	at t=70.00s, iterations=6668 l1error_phys=7.840082e-05 l2error_phys=6.995891e-04 stopping=3.826388e+03		at t=35.00s, iterations=3334 l1error_phys=7.840643e-05 l2error_phys=6.996542e-04 stopping=3.826131e+03
	GLUPS=2.714 iter=6668 t=70.001s dt=1.05e-02 lbmVisc=4.00e-02 WT=5s ETA=2s		GLUPS=2.675 iter=3334 t=35.000s dt=1.05e-02 lbmVisc=4.00e-02 WT=3s ETA=1s
	at t=80.01s, iterations=7621 l1error_phys=7.840082e-05 l2error_phys=6.995891e-04 stopping=2.550991e+03		at t=40.01s, iterations=3811 l1error_phys=7.840212e-05 l2error_phys=6.995974e-04 stopping=2.550966e+03
	GLUPS=2.716 iter=7621 t=80.005s dt=1.05e-02 lbmVisc=4.00e-02 WT=6s ETA=2s		GLUPS=2.672 iter=3811 t=40.008s dt=1.05e-02 lbmVisc=4.00e-02 WT=3s ETA=1s
	at t=90.01s, iterations=8574 l1error_phys=7.840082e-05 l2error_phys=6.995891e-04 stopping=1.275594e+03		at t=45.01s, iterations=4287 l1error_phys=7.840104e-05 l2error_phys=6.995909e-04 stopping=1.275608e+03
	GLUPS=2.715 iter=8574 t=90.010s dt=1.05e-02 lbmVisc=4.00e-02 WT=7s ETA=1s		GLUPS=2.673 iter=4287 t=45.005s dt=1.05e-02 lbmVisc=4.00e-02 WT=4s ETA=0s
	at t=100.00s, iterations=9526 l1error_phys=7.840082e-05 l2error_phys=6.995891e-04 stopping=1.975037e-01		at t=50.00s, iterations=4763 l1error_phys=7.840092e-05 l2error_phys=6.995902e-04 stopping=2.144418e-01
	GLUPS=2.715 iter=9526 t=100.004s dt=1.05e-02 lbmVisc=4.00e-02 WT=7s ETA=-0s		GLUPS=2.672 iter=4763 t=50.002s dt=1.05e-02 lbmVisc=4.00e-02 WT=4s ETA=-0s
	physFinalTime reached		physFinalTime reached
	total walltime: 7.5 s, SimInit time: 0.1 s, SimUpdate time: 7.2 s, AfterSimUpdate time: 0.1 s		total walltime: 4.2 s, SimInit time: 0.5 s, SimUpdate time: 3.6 s, AfterSimUpdate time: 0.1 s
	compute time: 7.2 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s		compute time: 3.6 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s
	final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.729, based on compute time: 2.789		final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.693, based on compute time: 2.792
	CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 256, 256, 256 ]		CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 256, 256, 256 ]
	Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB		Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB
	Local memory budget analysis / estimation for MPI rank 0		Local memory budget analysis / estimation for MPI rank 0
	@@ -66,30 +76,30 @@ LBM block 0: local=[256,256,256], offset=[0,0,0]

	START: simulation NSE:CUM lbmVisc 8.000000e-02 physDl 9.842520e-04 physDt 5.166677e-03		START: simulation NSE:CUM lbmVisc 8.000000e-02 physDl 9.842520e-04 physDt 5.166677e-03
	at t=0.00s, iterations=0 l1error_phys=2.288078e-04 l2error_phys=2.149036e-03 stopping=4.369480e+03		at t=0.00s, iterations=0 l1error_phys=2.288078e-04 l2error_phys=2.149036e-03 stopping=4.369480e+03
	at t=10.00s, iterations=1936 l1error_phys=8.316932e-05 l2error_phys=8.011302e-04 stopping=1.082057e+04		at t=2.50s, iterations=484 l1error_phys=1.090588e-04 l2error_phys=1.100312e-03 stopping=8.251637e+03
	GLUPS=2.528 iter=1936 t=10.003s dt=5.17e-03 lbmVisc=8.00e-02 WT=13s ETA=116s		GLUPS=1.057 iter=484 t=2.501s dt=5.17e-03 lbmVisc=8.00e-02 WT=8s ETA=69s
	at t=20.00s, iterations=3871 l1error_phys=7.824425e-05 l2error_phys=6.966192e-04 stopping=1.022379e+04		at t=5.00s, iterations=968 l1error_phys=9.205280e-05 l2error_phys=9.297702e-04 stopping=8.690032e+03
	GLUPS=2.744 iter=3871 t=20.000s dt=5.17e-03 lbmVisc=8.00e-02 WT=25s ETA=99s		GLUPS=2.678 iter=968 t=5.001s dt=5.17e-03 lbmVisc=8.00e-02 WT=11s ETA=43s
	at t=30.00s, iterations=5807 l1error_phys=7.815947e-05 l2error_phys=6.978556e-04 stopping=8.955548e+03		at t=7.50s, iterations=1452 l1error_phys=8.644554e-05 l2error_phys=8.615001e-04 stopping=8.097081e+03
	GLUPS=2.744 iter=5807 t=30.003s dt=5.17e-03 lbmVisc=8.00e-02 WT=37s ETA=85s		GLUPS=2.677 iter=1452 t=7.502s dt=5.17e-03 lbmVisc=8.00e-02 WT=14s ETA=32s
	at t=40.00s, iterations=7742 l1error_phys=7.815431e-05 l2error_phys=6.977104e-04 stopping=7.676720e+03		at t=10.00s, iterations=1936 l1error_phys=8.316932e-05 l2error_phys=8.011302e-04 stopping=7.213819e+03
	GLUPS=2.744 iter=7742 t=40.000s dt=5.17e-03 lbmVisc=8.00e-02 WT=48s ETA=73s		GLUPS=2.678 iter=1936 t=10.003s dt=5.17e-03 lbmVisc=8.00e-02 WT=17s ETA=25s
	at t=50.00s, iterations=9678 l1error_phys=7.815259e-05 l2error_phys=6.976981e-04 stopping=6.397440e+03		at t=12.50s, iterations=2420 l1error_phys=8.092820e-05 l2error_phys=7.505379e-04 stopping=6.178057e+03
	GLUPS=2.744 iter=9678 t=50.003s dt=5.17e-03 lbmVisc=8.00e-02 WT=60s ETA=60s		GLUPS=2.677 iter=2420 t=12.503s dt=5.17e-03 lbmVisc=8.00e-02 WT=20s ETA=20s
	at t=60.00s, iterations=11613 l1error_phys=7.815246e-05 l2error_phys=6.976969e-04 stopping=5.118000e+03		at t=15.00s, iterations=2904 l1error_phys=7.945723e-05 l2error_phys=7.150905e-04 stopping=5.034011e+03
	GLUPS=2.744 iter=11613 t=60.001s dt=5.17e-03 lbmVisc=8.00e-02 WT=72s ETA=48s		GLUPS=2.677 iter=2904 t=15.004s dt=5.17e-03 lbmVisc=8.00e-02 WT=23s ETA=15s
	at t=70.00s, iterations=13549 l1error_phys=7.815243e-05 l2error_phys=6.976965e-04 stopping=3.838552e+03		at t=17.50s, iterations=3388 l1error_phys=7.860591e-05 l2error_phys=7.001329e-04 stopping=3.816473e+03
	GLUPS=2.744 iter=13549 t=70.003s dt=5.17e-03 lbmVisc=8.00e-02 WT=84s ETA=36s		GLUPS=2.677 iter=3388 t=17.505s dt=5.17e-03 lbmVisc=8.00e-02 WT=26s ETA=11s
	at t=80.00s, iterations=15484 l1error_phys=7.815242e-05 l2error_phys=6.976965e-04 stopping=2.559101e+03		at t=20.00s, iterations=3871 l1error_phys=7.824425e-05 l2error_phys=6.966192e-04 stopping=2.556170e+03
	GLUPS=2.744 iter=15484 t=80.001s dt=5.17e-03 lbmVisc=8.00e-02 WT=96s ETA=24s		GLUPS=2.677 iter=3871 t=20.000s dt=5.17e-03 lbmVisc=8.00e-02 WT=29s ETA=7s
	at t=90.00s, iterations=17420 l1error_phys=7.815242e-05 l2error_phys=6.976965e-04 stopping=1.279650e+03		at t=22.50s, iterations=4355 l1error_phys=7.814485e-05 l2error_phys=6.966071e-04 stopping=1.279848e+03
	GLUPS=2.744 iter=17420 t=90.004s dt=5.17e-03 lbmVisc=8.00e-02 WT=108s ETA=12s		GLUPS=2.677 iter=4355 t=22.501s dt=5.17e-03 lbmVisc=8.00e-02 WT=32s ETA=4s
	at t=100.00s, iterations=19355 l1error_phys=7.815242e-05 l2error_phys=6.976965e-04 stopping=1.993199e-01		at t=25.00s, iterations=4839 l1error_phys=7.813483e-05 l2error_phys=6.972509e-04 stopping=2.733307e-01
	GLUPS=2.744 iter=19355 t=100.001s dt=5.17e-03 lbmVisc=8.00e-02 WT=119s ETA=-0s		GLUPS=2.677 iter=4839 t=25.002s dt=5.17e-03 lbmVisc=8.00e-02 WT=35s ETA=-0s
	physFinalTime reached		physFinalTime reached
	total walltime: 119.4 s, SimInit time: 0.9 s, SimUpdate time: 117.6 s, AfterSimUpdate time: 0.7 s		total walltime: 35.0 s, SimInit time: 4.5 s, SimUpdate time: 29.8 s, AfterSimUpdate time: 0.5 s
	compute time: 117.3 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s		compute time: 29.3 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s
	final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.744, based on compute time: 2.769		final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.676, based on compute time: 2.767
	CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 512, 512, 512 ]		CUDA block size optimizer: using block size [ 1, 128, 1 ] for subdomain size [ 512, 512, 512 ]
	Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB		Rank 0 uses GPU id 0: NVIDIA A100-SXM4-40GB
	Local memory budget analysis / estimation for MPI rank 0		Local memory budget analysis / estimation for MPI rank 0
	@@ -108,27 +118,27 @@ LBM block 0: local=[512,512,512], offset=[0,0,0]

	START: simulation NSE:CUM lbmVisc 1.000000e-01 physDl 4.901961e-04 physDt 1.601948e-03		START: simulation NSE:CUM lbmVisc 1.000000e-01 physDl 4.901961e-04 physDt 1.601948e-03
	at t=0.00s, iterations=0 l1error_phys=2.288047e-04 l2error_phys=2.149036e-03 stopping=4.369540e+03		at t=0.00s, iterations=0 l1error_phys=2.288047e-04 l2error_phys=2.149036e-03 stopping=4.369540e+03
	at t=10.00s, iterations=6243 l1error_phys=8.326898e-05 l2error_phys=7.992288e-04 stopping=1.080762e+04		at t=1.25s, iterations=781 l1error_phys=1.251458e-04 l2error_phys=1.261446e-03 stopping=7.190796e+03
	GLUPS=2.682 iter=6243 t=10.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=312s ETA=2811s		GLUPS=1.423 iter=781 t=1.251s dt=1.60e-03 lbmVisc=1.00e-01 WT=74s ETA=662s
	at t=20.00s, iterations=12485 l1error_phys=7.859293e-05 l2error_phys=7.007605e-04 stopping=1.017843e+04		at t=2.50s, iterations=1561 l1error_phys=1.046941e-04 l2error_phys=1.042804e-03 stopping=7.640647e+03
	GLUPS=2.747 iter=12485 t=20.000s dt=1.60e-03 lbmVisc=1.00e-01 WT=617s ETA=2469s		GLUPS=2.694 iter=1561 t=2.501s dt=1.60e-03 lbmVisc=1.00e-01 WT=113s ETA=450s
	at t=30.00s, iterations=18728 l1error_phys=7.851808e-05 l2error_phys=7.021279e-04 stopping=8.914642e+03		at t=3.75s, iterations=2341 l1error_phys=9.642448e-05 l2error_phys=9.738965e-04 stopping=7.259043e+03
	GLUPS=2.749 iter=18728 t=30.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=922s ETA=2152s		GLUPS=2.694 iter=2341 t=3.750s dt=1.60e-03 lbmVisc=1.00e-01 WT=151s ETA=353s
	at t=40.00s, iterations=24970 l1error_phys=7.851468e-05 l2error_phys=7.020078e-04 stopping=7.641481e+03		at t=5.00s, iterations=3122 l1error_phys=9.235201e-05 l2error_phys=9.344510e-04 stopping=6.496482e+03
	GLUPS=2.749 iter=24970 t=40.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=1227s ETA=1840s		GLUPS=2.694 iter=3122 t=5.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=190s ETA=285s
	at t=50.00s, iterations=31212 l1error_phys=7.851285e-05 l2error_phys=7.019943e-04 stopping=6.368082e+03		at t=6.25s, iterations=3902 l1error_phys=8.915845e-05 l2error_phys=8.955917e-04 stopping=5.607719e+03
	GLUPS=2.750 iter=31212 t=50.000s dt=1.60e-03 lbmVisc=1.00e-01 WT=1532s ETA=1532s		GLUPS=2.693 iter=3902 t=6.251s dt=1.60e-03 lbmVisc=1.00e-01 WT=229s ETA=229s
	at t=60.00s, iterations=37455 l1error_phys=7.851270e-05 l2error_phys=7.019928e-04 stopping=5.094515e+03		at t=7.50s, iterations=4682 l1error_phys=8.667638e-05 l2error_phys=8.605238e-04 stopping=4.614717e+03
	GLUPS=2.750 iter=37455 t=60.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=1836s ETA=1224s		GLUPS=2.693 iter=4682 t=7.500s dt=1.60e-03 lbmVisc=1.00e-01 WT=268s ETA=179s
	at t=70.00s, iterations=43697 l1error_phys=7.851266e-05 l2error_phys=7.019924e-04 stopping=3.820937e+03		at t=8.75s, iterations=5463 l1error_phys=8.475672e-05 l2error_phys=8.286925e-04 stopping=3.539514e+03
	GLUPS=2.750 iter=43697 t=70.000s dt=1.60e-03 lbmVisc=1.00e-01 WT=2141s ETA=918s		GLUPS=2.693 iter=5463 t=8.751s dt=1.60e-03 lbmVisc=1.00e-01 WT=307s ETA=131s
	at t=80.00s, iterations=49940 l1error_phys=7.851265e-05 l2error_phys=7.019923e-04 stopping=2.547358e+03		at t=10.00s, iterations=6243 l1error_phys=8.326898e-05 l2error_phys=7.992288e-04 stopping=2.401945e+03
	GLUPS=2.750 iter=49940 t=80.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=2446s ETA=611s		GLUPS=2.693 iter=6243 t=10.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=346s ETA=86s
	at t=90.00s, iterations=56182 l1error_phys=7.851265e-05 l2error_phys=7.019923e-04 stopping=1.273778e+03		at t=11.25s, iterations=7023 l1error_phys=8.208267e-05 l2error_phys=7.725630e-04 stopping=1.218492e+03
	GLUPS=2.750 iter=56182 t=90.001s dt=1.60e-03 lbmVisc=1.00e-01 WT=2750s ETA=306s		GLUPS=2.693 iter=7023 t=11.250s dt=1.60e-03 lbmVisc=1.00e-01 WT=385s ETA=43s
	at t=100.00s, iterations=62424 l1error_phys=7.851265e-05 l2error_phys=7.019923e-04 stopping=1.975941e-01		at t=12.50s, iterations=7803 l1error_phys=8.112176e-05 l2error_phys=7.492136e-04 stopping=3.231521e-01
	GLUPS=2.750 iter=62424 t=100.000s dt=1.60e-03 lbmVisc=1.00e-01 WT=3055s ETA=0s		GLUPS=2.693 iter=7803 t=12.500s dt=1.60e-03 lbmVisc=1.00e-01 WT=424s ETA=-0s
	physFinalTime reached		physFinalTime reached
	total walltime: 3055.2 s, SimInit time: 6.4 s, SimUpdate time: 3041.1 s, AfterSimUpdate time: 7.3 s		total walltime: 423.6 s, SimInit time: 33.8 s, SimUpdate time: 384.2 s, AfterSimUpdate time: 5.1 s
	compute time: 3038.2 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s		compute time: 380.3 s, compute overlaps time: 0.0 s, wait for communication time: 0.0 s, wait for computation time: 0.0 s
	final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.749, based on compute time: 2.758		final GLUPS: average (based on SimInit + SimUpdate + AfterSimUpdate time): 2.690, based on compute time: 2.754

data/lbm/scaling_on_Karolina/weak scaling 3D/sim_2, DP, AA pattern, 128 threads/np_02.log

+171 −155

File changed.

Preview size limit exceeded, changes collapsed.