Refactor SpMV kernels using CudaBlockReduceShfl::warpReduce

Various SpMV kernels have "inlined" code for parallel reduction across warp, e.g. EllpackCudaReductionKernelFull. They should call CudaBlockReduceShfl::warpReduce instead.