Added tests of the reduction and scan algorithm with CustomScalar
This way we test both the general CUDA implementation using shared memory and the specialization using __shfl instructions. Both the reduction and scan kernels needed some tweaks due to shared memory usage with non-fundamental types.
parent
2d454b15
Please register or sign in to comment