CUDA reduction does not work with result type whose alignment is not 8, 16, 32 or 64 bits
The implementation relies on
extern __shared__ variables, which are very restricted - see the comment in SharedMemory.h. There are specializations only for 8, 16, 32 and 64 bit types and it is not possible to make it general for any type, so CUDA reduction does not work for types such as
StaticVector< 5, double > or general
structs whose size may not even be power of 2.
It would be much easier to use static size arrays for the shared memory (i.e. without
extern). It is possible, since we only launch reduction kernels with a constant block size.