Skip to content
Snippets Groups Projects
Commit 7a49e478 authored by Tomáš Oberhuber's avatar Tomáš Oberhuber Committed by Tomáš Oberhuber
Browse files

Optimized CUDA reduction by decreasing desired grid size.

parent 7a51ba8a
No related branches found
No related tags found
1 merge request!32Expression templates 2
...@@ -195,11 +195,16 @@ struct CudaReductionKernelLauncher ...@@ -195,11 +195,16 @@ struct CudaReductionKernelLauncher
// where blocksPerMultiprocessor is determined according to the number of // where blocksPerMultiprocessor is determined according to the number of
// available registers on the multiprocessor. // available registers on the multiprocessor.
// On Tesla K40c, desGridSize = 8 * 15 = 120. // On Tesla K40c, desGridSize = 8 * 15 = 120.
//
// Update:
// It seems to be better to map only one CUDA block per one multiprocessor or maybe
// just slightly more. Therefore we omit blocksdPerMultiprocessor in the following.
CudaReductionKernelLauncher( const Index size ) CudaReductionKernelLauncher( const Index size )
: activeDevice( Devices::CudaDeviceInfo::getActiveDevice() ), : activeDevice( Devices::CudaDeviceInfo::getActiveDevice() ),
blocksdPerMultiprocessor( Devices::CudaDeviceInfo::getRegistersPerMultiprocessor( activeDevice ) blocksdPerMultiprocessor( Devices::CudaDeviceInfo::getRegistersPerMultiprocessor( activeDevice )
/ ( Reduction_maxThreadsPerBlock * Reduction_registersPerThread ) ), / ( Reduction_maxThreadsPerBlock * Reduction_registersPerThread ) ),
desGridSize( blocksdPerMultiprocessor * Devices::CudaDeviceInfo::getCudaMultiprocessors( activeDevice ) ), //desGridSize( blocksdPerMultiprocessor * Devices::CudaDeviceInfo::getCudaMultiprocessors( activeDevice ) ),
desGridSize( Devices::CudaDeviceInfo::getCudaMultiprocessors( activeDevice ) ),
originalSize( size ) originalSize( size )
{ {
} }
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment