CUDA reduction uses the same buffer as input and output array

In TNL/Containers/Algorithms/Reduction_impl.h:139 whel calling CudaReductionKernelLauncher we use the same buffer deviceAux1 as input and output buffer for the reduction. If the CUDA block with index 0 is not the first one to finish its work, its data can be overwritten by other CUDA blocks. We want to avoid allocating of two buffers. Solution might be to increase the size of deviceAux1 and split into two buffers. We need to check performance when fixing this!