CUDA reduction uses the same buffer as input and output array
TNL/Containers/Algorithms/Reduction_impl.h:139 whel calling
CudaReductionKernelLauncher we use the same buffer
deviceAux1 as input and output buffer for the reduction. If the CUDA block with index 0 is not the first one to finish its work, its data can be overwritten by other CUDA blocks. We want to avoid allocating of two buffers. Solution might be to increase the size of
deviceAux1 and split into two buffers. We need to check performance when fixing this!