Found a way to avoid using volatile in CUDA reduction: __syncwarp()
The performance seems to be identical to the code using volatile.
Loading
Please register or sign in to comment
The performance seems to be identical to the code using volatile.