CUDA reduction uses the same buffer as input and output array
In TNL/Containers/Algorithms/Reduction_impl.h:139
whel calling CudaReductionKernelLauncher
we use the same buffer deviceAux1
as input and output buffer for the reduction. If the CUDA block with index 0 is not the first one to finish its work, its data can be overwritten by other CUDA blocks. We want to avoid allocating of two buffers. Solution might be to increase the size of deviceAux1
and split into two buffers. We need to check performance when fixing this!