Reduction and multireduction refactoring
Brief summary:
- rewritten multireduction using lambda functions
- avoided
volatileusing__syncwarp() - using reduction functions with
return a + binstead ofa += b - using
std::plus,std::multiplies,std::logical_and,std::logical_or, etc. instead of custom lambda functions - optimized OpenMP thread counts for reduction and multireduction
- added computation of sample standard deviation to benchmarks
- implemented parallel prefix-sum with OpenMP
- implemented distributed prefix-sum
Edited by Jakub Klinkovský