Refactor SpMV kernels using CudaBlockReduceShfl::warpReduce
Various SpMV kernels have "inlined" code for parallel reduction across warp, e.g. EllpackCudaReductionKernelFull. They should call CudaBlockReduceShfl::warpReduce instead.
Various SpMV kernels have "inlined" code for parallel reduction across warp, e.g. EllpackCudaReductionKernelFull. They should call CudaBlockReduceShfl::warpReduce instead.