CUDA reduction does not work with result type whose alignment is not 8, 16, 32 or 64 bits

The implementation relies on extern __shared__ variables, which are very restricted - see the comment in SharedMemory.h. There are specializations only for 8, 16, 32 and 64 bit types and it is not possible to make it general for any type, so CUDA reduction does not work for types such as StaticVector< 5, double > or general structs whose size may not even be power of 2.

It would be much easier to use static size arrays for the shared memory (i.e. without extern). It is possible, since we only launch reduction kernels with a constant block size.