Template Numerical Library version develop:6514e4815
Flexible (parallel) reduction and prefix-sum tutorial

Introduction

This tutorial introduces flexible parallel reduction in TNL. It shows how to easily implement parallel reduction with user defined operations which may run on both CPU and GPU. Parallel reduction is a programming pattern appering very often in different kind of algorithms for example in scalar product, vector norms or mean value evaluation but also in sequences or strings comparison.

Flexible parallel reduction

We will explain the flexible parallel reduction on several examples. We start with the simplest sum of sequence of numbers followed by more advanced problems like scalar product or vector norms.

Sum

We start with simple problem of computing sum of sequence of numbers

\[ s = \sum_{i=1}^n a_i. \]

Sequentialy, such sum can be computed very easily as follows:

1double sequentialSum( const double* a, const int size )
2{
3 double sum( 0.0 );
4 for( int i = 0; i < size; i++ )
5 sum += a[ i ];
6 return sum;
7}

Doing the same in CUDA for GPU is, however, much more difficult (see. Optimizing Parallel Reduction in CUDA). The final code has tens of lines and it is something you do not want to write again and again anytime you need to sum a series of numbers. Using TNL and C++ lambda functions we may do the same on few lines of code efficiently and independently on the hardware beneath. Let us first rewrite the previous example using the C++ lambda functions:

1double sequentialSum( const double* a, const int size )
2{
3 auto fetch = [=] (int i)->double { return a[ i ]; };
4 auto reduction = [] (double& x, const double& y) { return x + y; };
5
6 double sum( 0.0 );
7 for( int i = 0; i < size; i++ )
8 sum = reduction( sum, fetch( i ) );
9 return sum;
10}
11

As can be seen, we split the reduction into two steps:

  1. fetch reads the input data. Thanks to this lambda you can:
    1. Connect the reduction algorithm with given input arrays or vectors (or any other data structure).
    2. Perform operation you need to do with the input data.
    3. Perform another secondary operation simoultanously with the parallel reduction.
  2. reduction is operation we want to do after the data fetch. Usually it is summation, multiplication, evaluation of minimum or maximum or some logical operation.

Putting everything together gives the following example:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double sum( const Vector< double, Device >& v )
12{
13 /****
14 * Get vector view which can be captured by lambda.
15 */
16 auto view = v.getConstView();
17
18 /****
19 * The fetch function just reads elements of vector v.
20 */
21 auto fetch = [=] __cuda_callable__ ( int i ) -> double { return view[ i ]; };
22
23 /***
24 * Reduction is sum of two numbers.
25 */
26 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
27
28 /***
29 * Finally we call the templated function Reduction and pass number of elements to reduce,
30 * lambdas defined above and finally value of idempotent element, zero in this case, which serve for the
31 * reduction initiation.
32 */
33 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
34}
35
36int main( int argc, char* argv[] )
37{
38 /***
39 * Firstly, test the sum with vectors allocated on CPU.
40 */
42 host_v = 1.0;
43 std::cout << "host_v = " << host_v << std::endl;
44 std::cout << "The sum of the host vector elements is " << sum( host_v ) << "." << std::endl;
45
46 /***
47 * And then also on GPU.
48 */
49#ifdef HAVE_CUDA
51 cuda_v = 1.0;
52 std::cout << "cuda_v = " << cuda_v << std::endl;
53 std::cout << "The sum of the CUDA vector elements is " << sum( cuda_v ) << "." << std::endl;
54#endif
55 return EXIT_SUCCESS;
56}
57
Vector extends Array with algebraic operations.
Definition: Vector.h:45
ConstViewType getConstView(IndexType begin=0, IndexType end=0) const
Returns a non-modifiable view of the vector.
Definition: Vector.hpp:62
T endl(T... args)
Namespace for fundamental TNL algorithms.
Definition: AtomicOperations.h:23
Namespace for TNL containers.
Definition: Array.h:25
The main TNL namespace.
Definition: AtomicOperations.h:22

Since TNL vectors cannot be pass to CUDA kernels and so they cannot be captured by CUDA lambdas, we must first get vector view from the vector using a method getConstView().

Note tha we pass 0.0 as the last argument of the template function reduce< Device >. It is an idempotent element (see Idempotence). It is an element which, for given operation, does not change the result. For addition, it is zero. The result looks as follows.

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the host vector elements is 10.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The sum of the CUDA vector elements is 10.

Sum of vector elements can be also obtained as sum(v).

Product

To demonstrate the effect of the idempotent element, we will now compute product of all elements of the vector. The idempotent element is one for multiplication and we also need to replace a+b with a*b in the definition of reduction. We get the following code:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double product( const Vector< double, Device >& v )
12{
13 auto view = v.getConstView();
14 auto fetch = [=] __cuda_callable__ ( int i ) { return view[ i ]; };
15 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a * b; };
16
17 /***
18 * Since we compute the product of all elements, the reduction must be initialized by 1.0 not by 0.0.
19 */
20 return reduce< Device >( 0, view.getSize(), fetch, reduction, 1.0 );
21}
22
23int main( int argc, char* argv[] )
24{
25 /***
26 * The first test on CPU ...
27 */
29 host_v = 1.0;
30 std::cout << "host_v = " << host_v << std::endl;
31 std::cout << "The product of the host vector elements is " << product( host_v ) << "." << std::endl;
32
33 /***
34 * ... the second test on GPU.
35 */
36#ifdef HAVE_CUDA
38 cuda_v = 1.0;
39 std::cout << "cuda_v = " << cuda_v << std::endl;
40 std::cout << "The product of the CUDA vector elements is " << product( cuda_v ) << "." << std::endl;
41#endif
42 return EXIT_SUCCESS;
43}
44

leading to output like this:

host_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the host vector elements is 1.
cuda_v = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The product of the CUDA vector elements is 1.

Product of vector elements can be computed using fuction product(v).

Scalar product

One of the most important operation in the linear algebra is the scalar product of two vectors. Compared to coputing the sum of vector elements we must change the function fetch to read elements from both vectors and multiply them. See the following example.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
12{
13 auto u_view = u.getConstView();
14 auto v_view = v.getConstView();
15
16 /***
17 * Fetch computes product of corresponding elements of both vectors.
18 */
19 auto fetch = [=] __cuda_callable__ ( int i ) { return u_view[ i ] * v_view[ i ]; };
20 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
21 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, 0.0 );
22}
23
24int main( int argc, char* argv[] )
25{
26 /***
27 * The first test on CPU ...
28 */
29 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
30 host_u = 1.0;
31 host_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
32 std::cout << "host_u = " << host_u << std::endl;
33 std::cout << "host_v = " << host_v << std::endl;
34 std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
35 std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
36
37 /***
38 * ... the second test on GPU.
39 */
40#ifdef HAVE_CUDA
41 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
42 cuda_u = 1.0;
43 cuda_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
44 std::cout << "cuda_u = " << cuda_u << std::endl;
45 std::cout << "cuda_v = " << cuda_v << std::endl;
46 std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
47 std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
48#endif
49 return EXIT_SUCCESS;
50}
51

The result is:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

Scalar product of vectors u and v in TNL can be computed by TNL::dot(u, v) or simply as (u, v).

Maximum norm

Maximum norm of a vector equals modulus of the vector largest element. Therefore, fetch must return the absolute value of the vector elements and reduction wil return maximum of given values. Look at the following example.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double maximumNorm( const Vector< double, Device >& v )
12{
13 auto view = v.getConstView();
14 auto fetch = [=] __cuda_callable__ ( int i ) { return abs( view[ i ] ); };
15 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return max( a, b ); };
16 return reduce< Device >( 0, view.getSize(), fetch, reduction, 0.0 );
17}
18
19int main( int argc, char* argv[] )
20{
22 host_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = i - 7; } );
23 std::cout << "host_v = " << host_v << std::endl;
24 std::cout << "The maximum norm of the host vector elements is " << maximumNorm( host_v ) << "." << std::endl;
25#ifdef HAVE_CUDA
27 cuda_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = i - 7; } );
28 std::cout << "cuda_v = " << cuda_v << std::endl;
29 std::cout << "The maximum norm of the CUDA vector elements is " << maximumNorm( cuda_v ) << "." << std::endl;
30#endif
31 return EXIT_SUCCESS;
32}
33
T max(T... args)

The output is:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the CUDA vector elements is 7.

Maximum norm in TNL is computed by the function TNL::maxNorm.

Vectors comparison

Comparison of two vectors involve (parallel) reduction as well. The fetch part is responsible for comparison of corresponding vector elements result of which is boolean true or false for each vector elements. The reduction part must perform logical and operation on all of them. We must not forget to change the idempotent element to true. The code may look as follows:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11bool comparison( const Vector< double, Device >& u, const Vector< double, Device >& v )
12{
13 auto u_view = u.getConstView();
14 auto v_view = v.getConstView();
15
16 /***
17 * Fetch compares corresponding elements of both vectors
18 */
19 auto fetch = [=] __cuda_callable__ ( int i )->bool { return ( u_view[ i ] == v_view[ i ] ); };
20
21 /***
22 * Reduce performs logical AND on intermediate results obtained by fetch.
23 */
24 auto reduction = [] __cuda_callable__ ( const bool& a, const bool& b ) { return a && b; };
25 return reduce< Device >( 0, v_view.getSize(), fetch, reduction, true );
26}
27
28int main( int argc, char* argv[] )
29{
30 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
31 host_u = 1.0;
32 host_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
33 std::cout << "host_u = " << host_u << std::endl;
34 std::cout << "host_v = " << host_v << std::endl;
35 std::cout << "Comparison of host_u and host_v is: " << ( comparison( host_u, host_v ) ? "'true'" : "'false'" ) << "." << std::endl;
36 std::cout << "Comparison of host_u and host_u is: " << ( comparison( host_u, host_u ) ? "'true'" : "'false'" ) << "." << std::endl;
37#ifdef HAVE_CUDA
38 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
39 cuda_u = 1.0;
40 cuda_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
41 std::cout << "cuda_u = " << cuda_u << std::endl;
42 std::cout << "cuda_v = " << cuda_v << std::endl;
43 std::cout << "Comparison of cuda_u and cuda_v is: " << ( comparison( cuda_u, cuda_v ) ? "'true'" : "'false'" ) << "." << std::endl;
44 std::cout << "Comparison of cuda_u and cuda_u is: " << ( comparison( cuda_u, cuda_u ) ? "'true'" : "'false'" ) << "." << std::endl;
45#endif
46 return EXIT_SUCCESS;
47}
48

And the output looks as:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of host_u and host_v is: 'false'.
Comparison of host_u and host_u is: 'true'.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
Comparison of cuda_u and cuda_v is: 'false'.
Comparison of cuda_u and cuda_u is: 'true'.

Update and residue

In iterative solvers we often need to update a vector and compute the update norm at the same time. For example the Euler method is defined as

\[ \bf u^{k+1} = \bf u^k + \tau \Delta \bf u. \]

Together with the vector addition, we may want to compute also \(L_2\)-norm of \(\Delta \bf u\) which may indicate convergence. Computing first the addition and then the norm would be inefficient because we would have to fetch the vector \(\Delta \bf u\) twice from the memory. The following example shows how to do the addition and norm computation at the same time.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double updateAndResidue( Vector< double, Device >& u, const Vector< double, Device >& delta_u, const double& tau )
12{
13 auto u_view = u.getView();
14 auto delta_u_view = delta_u.getConstView();
15 auto fetch = [=] __cuda_callable__ ( int i ) mutable ->double {
16 const double& add = delta_u_view[ i ];
17 u_view[ i ] += tau * add;
18 return add * add; };
19 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
20 return sqrt( reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 ) );
21}
22
23int main( int argc, char* argv[] )
24{
25 const double tau = 0.1;
26 Vector< double, Devices::Host > host_u( 10 ), host_delta_u( 10 );
27 host_u = 0.0;
28 host_delta_u = 1.0;
29 std::cout << "host_u = " << host_u << std::endl;
30 std::cout << "host_delta_u = " << host_delta_u << std::endl;
31 double residue = updateAndResidue( host_u, host_delta_u, tau );
32 std::cout << "New host_u is: " << host_u << "." << std::endl;
33 std::cout << "Residue is:" << residue << std::endl;
34#ifdef HAVE_CUDA
35 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_delta_u( 10 );
36 cuda_u = 0.0;
37 cuda_delta_u = 1.0;
38 std::cout << "cuda_u = " << cuda_u << std::endl;
39 std::cout << "cuda_delta_u = " << cuda_delta_u << std::endl;
40 residue = updateAndResidue( cuda_u, cuda_delta_u, tau );
41 std::cout << "New cuda_u is: " << cuda_u << "." << std::endl;
42 std::cout << "Residue is:" << residue << std::endl;
43#endif
44 return EXIT_SUCCESS;
45}
46
ViewType getView(IndexType begin=0, IndexType end=0)
Returns a modifiable view of the vector.
Definition: Vector.hpp:49
T sqrt(T... args)

The result reads as:

host_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
host_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New host_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228
cuda_u = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
cuda_delta_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
New cuda_u is: [ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 ].
Residue is:3.16228

Simple MapReduce

We can also filter the data to be reduced. This operation is called MapReduce . You simply add necessary if statement to the fetch function, or in the case of the following example we use a statement

return u_view[ i ] > 0.0 ? u_view[ i ] : 0.0;

to sum up only the positive numbers in the vector.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double mapReduce( Vector< double, Device >& u )
12{
13 auto u_view = u.getView();
14 auto fetch = [=] __cuda_callable__ ( int i )->double {
15 return u_view[ i ] > 0 ? u_view[ i ] : 0.0; };
16 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
17 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
18}
19
20int main( int argc, char* argv[] )
21{
23 host_u.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = sin( ( double ) i ); } );
24 double result = mapReduce( host_u );
25 std::cout << "host_u = " << host_u << std::endl;
26 std::cout << "Sum of the positive numbers is:" << result << std::endl;
27#ifdef HAVE_CUDA
29 cuda_u = host_u;
30 result = mapReduce( cuda_u );
31 std::cout << "cuda_u = " << cuda_u << std::endl;
32 std::cout << "Sum of the positive numbers is:" << result << std::endl;
33#endif
34 return EXIT_SUCCESS;
35}
36
T sin(T... args)

The result is:

host_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035
cuda_u = [ 0, 0.841471, 0.909297, 0.14112, -0.756802, -0.958924, -0.279415, 0.656987, 0.989358, 0.412118 ]
Sum of the positive numbers is:3.95035

Take a look at the following example where the filtering depends on the element indexes rather than values:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5#include <TNL/Timer.h>
6
7using namespace TNL;
8using namespace TNL::Containers;
9using namespace TNL::Algorithms;
10
11template< typename Device >
12double mapReduce( Vector< double, Device >& u )
13{
14 auto u_view = u.getView();
15 auto fetch = [=] __cuda_callable__ ( int i )->double {
16 if( i % 2 == 0 ) return u_view[ i ];
17 return 0.0; };
18 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
19 return reduce< Device >( 0, u_view.getSize(), fetch, reduction, 0.0 );
20}
21
22int main( int argc, char* argv[] )
23{
24 Timer timer;
25 Vector< double, Devices::Host > host_u( 100000 );
26 host_u = 1.0;
27 timer.start();
28 double result = mapReduce( host_u );
29 timer.stop();
30 std::cout << "Host tesult is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
31#ifdef HAVE_CUDA
32 Vector< double, Devices::Cuda > cuda_u( 100000 );
33 cuda_u = 1.0;
34 timer.reset();
35 timer.start();
36 result = mapReduce( cuda_u );
37 timer.stop();
38 std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
39#endif
40 return EXIT_SUCCESS;
41}
42
Class for real time, CPU time and CPU cycles measuring.
Definition: Timer.h:32
void reset()
Resets timer.
Definition: Timer.hpp:29
double getRealTime() const
Returns the elapsed real time on this timer.
Definition: Timer.hpp:60
void stop()
Stops (pauses) the timer.
Definition: Timer.hpp:40
void start()
Starts timer.
Definition: Timer.hpp:52

The result is:

Host tesult is:50000. It took 0.0181678 seconds.
CUDA result is:50000. It took 0.000426658 seconds.

This is not very efficient. For half of the elements, we return zero which has no effect during the reductin. Better solution is to run the reduction only for a half of the elements and to change the fetch function to

return u_view[ 2 * i ];

See the following example and compare the execution times.

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5#include <TNL/Timer.h>
6
7using namespace TNL;
8using namespace TNL::Containers;
9using namespace TNL::Algorithms;
10
11template< typename Device >
12double mapReduce( Vector< double, Device >& u )
13{
14 auto u_view = u.getView();
15 auto fetch = [=] __cuda_callable__ ( int i )->double {
16 return u_view[ 2 * i ]; };
17 auto reduction = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
18 return reduce< Device >( 0, u_view.getSize() / 2, fetch, reduction, 0.0 );
19}
20
21int main( int argc, char* argv[] )
22{
23 Timer timer;
24 Vector< double, Devices::Host > host_u( 100000 );
25 host_u = 1.0;
26 timer.start();
27 double result = mapReduce( host_u );
28 timer.stop();
29 std::cout << "Host result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
30#ifdef HAVE_CUDA
31 Vector< double, Devices::Cuda > cuda_u( 100000 );
32 cuda_u = 1.0;
33 timer.reset();
34 timer.start();
35 result = mapReduce( cuda_u );
36 timer.stop();
37 std::cout << "CUDA result is:" << result << ". It took " << timer.getRealTime() << " seconds." << std::endl;
38#endif
39 return EXIT_SUCCESS;
40}
41
Host result is:50000. It took 0.0161199 seconds.
CUDA result is:50000. It took 0.000416337 seconds.

Reduction with argument

In some situations we may need to locate given element in the vector. For example index of the smallest or the largest element. reduceWithArgument is a function which can do it. In the following example, we modify function for computing the maximum norm of a vector. Instead of just computing the value, now we want to get index of the element having the absolute value equal to the max norm. The lambda function reduction do not compute only maximum of two given elements anymore, but it must also compute index of the winner. See the following code:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
12maximumNorm( const Vector< double, Device >& v )
13{
14 auto view = v.getConstView();
15
16 auto fetch = [=] __cuda_callable__ ( int i ) { return abs( view[ i ] ); };
17 auto reduction = [] __cuda_callable__ ( double& a, const double& b, int& aIdx, const int& bIdx ) {
18 if( a < b ) {
19 a = b;
20 aIdx = bIdx;
21 }
22 else if( a == b && bIdx < aIdx )
23 aIdx = bIdx;
24 };
25 return reduceWithArgument< Device >( 0, view.getSize(), fetch, reduction, std::numeric_limits< double >::max() );
26}
27
28int main( int argc, char* argv[] )
29{
31 host_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = i - 7; } );
32 std::cout << "host_v = " << host_v << std::endl;
33 auto maxNormHost = maximumNorm( host_v );
34 std::cout << "The maximum norm of the host vector elements is " << maxNormHost.first << " at position " << maxNormHost.second << "." << std::endl;
35#ifdef HAVE_CUDA
37 cuda_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = i - 7; } );
38 std::cout << "cuda_v = " << cuda_v << std::endl;
39 auto maxNormCuda = maximumNorm( cuda_v );
40 std::cout << "The maximum norm of the device vector elements is " << maxNormCuda.first << " at position " << maxNormCuda.second << "." << std::endl;
41#endif
42 return EXIT_SUCCESS;
43}
44

The definition of the lambda function reduction reads as:

auto reduction = [] __cuda_callable__ ( double& a, const double& b, int& aIdx, const int& bIdx );

In addition to vector elements values a and b, it gets also their positions aIdx and bIdx. The functions is responsible to set a to maximum of the two and aIdx to the position of the larger element. Note, that the parameters have the above mentioned meaning only in case of computing minimum or maximum.

The result looks as:

host_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the host vector elements is 7 at position 0.
cuda_v = [ -7, -6, -5, -4, -3, -2, -1, 0, 1, 2 ]
The maximum norm of the device vector elements is 1.79769e+308 at position 10.

Using functionals for reduction

You might notice, that the lambda function reduction does not take so many different form compared to fetch. In addition, setting the zero (or idempotent) element can be annoying especially when computing minimum or maximum and we need to check std::limits function to make the code working with any type. To make things simpler, TNL offers variants of several functionals known from STL. They can be used instead of the lambda function reduction and they also carry the idempotent element. See the following example showing the scalar product of two vectors, now with functional:

1#include <iostream>
2#include <cstdlib>
3#include <TNL/Containers/Vector.h>
4#include <TNL/Algorithms/reduce.h>
5
6using namespace TNL;
7using namespace TNL::Containers;
8using namespace TNL::Algorithms;
9
10template< typename Device >
11double scalarProduct( const Vector< double, Device >& u, const Vector< double, Device >& v )
12{
13 auto u_view = u.getConstView();
14 auto v_view = v.getConstView();
15
16 /***
17 * Fetch computes product of corresponding elements of both vectors.
18 */
19 return reduce< Device >(
20 0, v_view.getSize(),
21 [=] __cuda_callable__ ( int i ) { return u_view[ i ] * v_view[ i ]; },
22 TNL::Plus{} );
23}
24
25int main( int argc, char* argv[] )
26{
27 /***
28 * The first test on CPU ...
29 */
30 Vector< double, Devices::Host > host_u( 10 ), host_v( 10 );
31 host_u = 1.0;
32 host_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
33 std::cout << "host_u = " << host_u << std::endl;
34 std::cout << "host_v = " << host_v << std::endl;
35 std::cout << "The scalar product ( host_u, host_v ) is " << scalarProduct( host_u, host_v ) << "." << std::endl;
36 std::cout << "The scalar product ( host_v, host_v ) is " << scalarProduct( host_v, host_v ) << "." << std::endl;
37
38 /***
39 * ... the second test on GPU.
40 */
41#ifdef HAVE_CUDA
42 Vector< double, Devices::Cuda > cuda_u( 10 ), cuda_v( 10 );
43 cuda_u = 1.0;
44 cuda_v.forAllElements( [] __cuda_callable__ ( int i, double& value ) { value = 2 * ( i % 2 ) - 1; } );
45 std::cout << "cuda_u = " << cuda_u << std::endl;
46 std::cout << "cuda_v = " << cuda_v << std::endl;
47 std::cout << "The scalar product ( cuda_u, cuda_v ) is " << scalarProduct( cuda_u, cuda_v ) << "." << std::endl;
48 std::cout << "The scalar product ( cuda_v, cuda_v ) is " << scalarProduct( cuda_v, cuda_v ) << "." << std::endl;
49#endif
50 return EXIT_SUCCESS;
51}
52
Function object implementing x + y.
Definition: Functional.h:24

This example also shows more compact how to evoke the function reduce (lines 19-22). This way, one should be able to perform (parallel) reduction very easily. The result looks as follows:

host_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( host_u, host_v ) is 0.
The scalar product ( host_v, host_v ) is 10.
cuda_u = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v = [ -1, 1, -1, 1, -1, 1, -1, 1, -1, 1 ]
The scalar product ( cuda_u, cuda_v ) is 0.
The scalar product ( cuda_v, cuda_v ) is 10.

In TNL/Functionals.h you may find probably all operations that can be reasonably used for reduction:

Functional Reduction operation
TNL::Plus Sum
TNL::Multiplies Product
TNL::Min Minimum
TNL::Max Maximum
TNL::MinWithArg Minimum with argument
TNL::MaxWithArg Maximum with argument
TNL::LogicalAnd Logical AND
TNL::LogicalOr Logical OR
TNL::BitAnd Bit AND
TNL::BitOr Bit OR

Flexible scan

Inclusive and exclusive scan

Inclusive scan (or prefix sum) operation turns a sequence \(a_1, \ldots, a_n\) into a sequence \(s_1, \ldots, s_n\) defined as

\[ s_i = \sum_{j=1}^i a_i. \]

Exclusive scan (or prefix sum) is defined as

\[ \sigma_i = \sum_{j=1}^{i-1} a_i. \]

For example, inclusive prefix sum of

[1,3,5,7,9,11,13]

is

[1,4,9,16,25,36,49]

and exclusive prefix sum of the same sequence is

[0,1,4,9,16,25,36]

Both kinds of scan are usually applied only on summation, however product or logical operations could be handy as well. In TNL, scan is implemented in similar way as reduction and uses the same functors as the reduction operation. The following example shows how it works:

inplaceInclusiveScan( array, 0, array.getSize(), TNL::Plus{} );

This is equivalent to the following shortened call (the second, third and fourth parameters have a default value):

inplaceInclusiveScan( array );

The complete example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9int main( int argc, char* argv[] )
10{
11 /***
12 * Firstly, test the prefix sum with an array allocated on CPU.
13 */
15 host_a = 1.0;
16 std::cout << "host_a = " << host_a << std::endl;
17 inplaceInclusiveScan( host_a );
18 std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
19
20 /***
21 * And then also on GPU.
22 */
23#ifdef HAVE_CUDA
25 cuda_a = 1.0;
26 std::cout << "cuda_a = " << cuda_a << std::endl;
27 inplaceInclusiveScan( cuda_a );
28 std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
29#endif
30 return EXIT_SUCCESS;
31}
Array is responsible for memory management, access to array elements, and general array operations.
Definition: Array.h:73
void inplaceInclusiveScan(Array &array, typename Array::IndexType begin, typename Array::IndexType end, Reduction &&reduction, typename Array::ValueType identity)
Computes an inclusive scan (or prefix sum) of an array in-place.
Definition: scan.h:240

Scan does not use fetch function because the scan must be performed on an array. Its complexity is also higher compared to reduction. Thus if one needs to do some operation with the array elements before the scan, this can be done explicitly and it will not affect the performance significantly. On the other hand, the scan function takes interval of the vector elements where the scan is performed as its second and third argument. The next argument is the operation to be performed by the scan and the last parameter is the idempotent ("zero") element of the operation.

The result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ].

Exclusive scan works similarly. The complete example looks as follows:

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/scan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9int main( int argc, char* argv[] )
10{
11 /***
12 * Firstly, test the prefix sum with an array allocated on CPU.
13 */
15 host_a = 1.0;
16 std::cout << "host_a = " << host_a << std::endl;
17 inplaceExclusiveScan( host_a );
18 std::cout << "The prefix sum of the host array is " << host_a << "." << std::endl;
19
20 /***
21 * And then also on GPU.
22 */
23#ifdef HAVE_CUDA
25 cuda_a = 1.0;
26 std::cout << "cuda_a = " << cuda_a << std::endl;
27 inplaceExclusiveScan( cuda_a );
28 std::cout << "The prefix sum of the CUDA array is " << cuda_a << "." << std::endl;
29#endif
30 return EXIT_SUCCESS;
31}
void inplaceExclusiveScan(Array &array, typename Array::IndexType begin, typename Array::IndexType end, Reduction &&reduction, typename Array::ValueType identity)
Computes an exclusive scan (or prefix sum) of an array in-place.
Definition: scan.h:314

And the result looks as:

host_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the host array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].
cuda_a = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
The prefix sum of the CUDA array is [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ].

Segmented scan

Segmented scan is a modification of common scan. In this case the sequence of numbers in hand is divided into segments like this, for example

[1,3,5][2,4,6,9][3,5],[3,6,9,12,15]

and we want to compute inclusive or exclusive scan of each segment. For inclusive segmented prefix sum we get

[1,4,9][2,6,12,21][3,8][3,9,18,30,45]

and for exclusive segmented prefix sum it is

[0,1,4][0,2,6,12][0,3][0,3,9,18,30]

In addition to common scan, we need to encode the segments of the input sequence. It is done by auxiliary flags array (it can be array of booleans) having 1 at the begining of each segment and 0 on all other positions. In our example, it would be like this:

[1,0,0,1,0,0,0,1,0,1,0,0, 0, 0]
[1,3,5,2,4,6,9,3,5,3,6,9,12,15]

Note: Segmented scan is not implemented for CUDA yet.

1#include <iostream>
2#include <TNL/Containers/Array.h>
3#include <TNL/Algorithms/SegmentedScan.h>
4
5using namespace TNL;
6using namespace TNL::Containers;
7using namespace TNL::Algorithms;
8
9template< typename Device >
10void segmentedScan( Array< double, Device >& v, Array< bool, Device >& flags )
11{
12 /***
13 * Reduction is sum of two numbers.
14 */
15 auto reduce = [] __cuda_callable__ ( const double& a, const double& b ) { return a + b; };
16
17 /***
18 * As parameters, we pass array on which the scan is to be performed, interval
19 * where the scan is performed, lambda function which is used by the scan and
20 * zero element (idempotent) of the 'sum' operation.
21 */
22 SegmentedScan< Device >::perform( v, flags, 0, v.getSize(), reduce, 0.0 );
23}
24
25int main( int argc, char* argv[] )
26{
27 /***
28 * Firstly, test the segmented prefix sum with arrays allocated on CPU.
29 */
30 Array< bool, Devices::Host > host_flags{ 1,0,0,1,0,0,0,1,0,1,0,0, 0, 0 };
31 Array< double, Devices::Host > host_v { 1,3,5,2,4,6,9,3,5,3,6,9,12,15 };
32 std::cout << "host_flags = " << host_flags << std::endl;
33 std::cout << "host_v = " << host_v << std::endl;
34 segmentedScan( host_v, host_flags );
35 std::cout << "The segmented prefix sum of the host array is " << host_v << "." << std::endl;
36
37 /***
38 * And then also on GPU.
39 */
40#ifdef HAVE_CUDA
41 //Array< bool, Devices::Cuda > cuda_flags{ 1,0,0,1,0,0,0,1,0,1,0,0, 0, 0 };
42 //Array< double, Devices::Cuda > cuda_v { 1,3,5,2,4,6,9,3,5,3,6,9,12,15 };
43 //std::cout << "cuda_flags = " << cuda_flags << std::endl;
44 //std::cout << "cuda_v = " << cuda_v << std::endl;
45 //segmentedScan( cuda_v, cuda_flags );
46 //std::cout << "The segmnted prefix sum of the CUDA array is " << cuda_v << "." << std::endl;
47#endif
48 return EXIT_SUCCESS;
49}
50
__cuda_callable__ IndexType getSize() const
Returns the current array size.
Definition: Array.hpp:346
Result reduce(const Index begin, const Index end, Fetch &&fetch, Reduction &&reduction, const Result &identity)
reduce implements (parallel) reduction for vectors and arrays.
Definition: reduce.h:78
Computes segmented scan (or prefix sum) on a vector.
Definition: SegmentedScan.h:67

The result reads as:

host_flags = [ 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0 ]
host_v = [ 1, 3, 5, 2, 4, 6, 9, 3, 5, 3, 6, 9, 12, 15 ]
The segmented prefix sum of the host array is [ 1, 4, 9, 2, 6, 12, 21, 3, 8, 3, 9, 18, 30, 45 ].