Found a way to avoid using volatile in CUDA reduction: __syncwarp()
The performance seems to be identical to the code using volatile.
parent
b74a24d2
Please register or sign in to comment
The performance seems to be identical to the code using volatile.