Template Numerical Library version develop:6514e4815
For loops

## Introduction

This tutorial shows how to use different kind of for-loops implemented in TNL. Namely, they are:

• Parallel for is a for-loop which can be run in parallel, i.e. all iterations of the loop must be independent. Parallel for can be run on both multicore CPUs and GPUs.
• n-dimensional parallel for is an extension of common parallel for into higher dimensions.
• Unrolled for is a for-loop which is performed sequentially and it is explicitly unrolled by C++ templates. Iteration bounds must be static (known at compile time).
• Static for is a for-loop with static bounds (known at compile time) and indices usable in constant expressions.

## Parallel For

Basic parallel for construction in TNL serves for hardware platform transparent expression of parallel for-loops. The hardware platform is specified by a template parameter. The loop is implemented as TNL::Algorithms::ParallelFor and can be used as:

ParallelFor< Device >::exec( start, end, function, arguments... );

The Device can be either TNL::Devices::Host or TNL::Devices::Cuda. The first two parameters define the loop bounds in the C style. It means that there will be iterations for indices start, start+1, ..., end-1. The function is a lambda function to be called in each iteration. It is supposed to receive the iteration index and arguments passed to the parallel for (the last arguments).

See the following example:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/ParallelFor.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
/****
* Set all elements of the vector v to the constant c.
*/
template< typename Device >
void initVector( Vector< double, Device >& v,
const double& c )
{
auto view = v.getView();
auto init = [=] __cuda_callable__ ( int i ) mutable
{
view[ i ] = c;
};
ParallelFor< Device >::exec( 0, v.getSize(), init );
}
int main( int argc, char* argv[] )
{
/***
* Firstly, test the vector initiation on CPU.
*/
Vector< double, Devices::Host > host_v( 10 );
initVector( host_v, 1.0 );
std::cout << "host_v = " << host_v << std::endl;
/***
* And then also on GPU.
*/
#ifdef HAVE_CUDA
Vector< double, Devices::Cuda > cuda_v( 10 );
initVector( cuda_v, 1.0 );
std::cout << "cuda_v = " << cuda_v << std::endl;
#endif
return EXIT_SUCCESS;
}
T endl(T... args)
Namespace for fundamental TNL algorithms.
Definition: AtomicOperations.h:23
Namespace for TNL containers.
Definition: Array.h:25
The main TNL namespace.
Definition: AtomicOperations.h:22

The result is:

host_v1 = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
host_v2 = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
The sum of the vectors on CPU is [ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ].
cuda_v1 = [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]
cuda_v2 = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
The sum of the vectors on GPU is [ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ].

## n-dimensional Parallel For

For-loops in higher dimensions can be performed similarly via TNL::Algorithms::ParallelFor2D and TNL::Algorithms::ParallelFor3D. In the following example we build a 2D mesh function on top of TNL::Containers::Vector. Two dimensional indices ( i, j ) are mapped to the vector index idx as idx = j * xSize + i, where the mesh function has dimensions xSize * ySize. The following simple example performs initiation of the mesh function with a constant value c = 1.0:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/ParallelFor.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
template< typename Device >
void initMeshFunction( const int xSize,
const int ySize,
Vector< double, Device >& v,
const double& c )
{
auto view = v.getView();
auto init = [=] __cuda_callable__ ( int i, int j ) mutable
{
view[ j * xSize + i ] = c;
};
ParallelFor2D< Device >::exec( 0, 0, xSize, ySize, init );
}
int main( int argc, char* argv[] )
{
/***
* Define dimensions of 2D mesh function.
*/
const int xSize( 10 ), ySize( 10 );
const int size = xSize * ySize;
/***
* Firstly, test the mesh function initiation on CPU.
*/
Vector< double, Devices::Host > host_v;
initMeshFunction( xSize, ySize, host_v, 1.0 );
/***
* And then also on GPU.
*/
#ifdef HAVE_CUDA
Vector< double, Devices::Cuda > cuda_v( size );
initMeshFunction( xSize, ySize, cuda_v, 1.0 );
#endif
return EXIT_SUCCESS;
}

Notice the parameters of the lambda function init. The first parameter i changes more often than j and therefore the index mapping has the form j * xSize + i to access the vector elements sequentially on CPU and to fulfill coalesced memory accesses on GPU. The for-loop is executed by calling ParallelFor2D with proper device. The first four parameters are startX, startY, endX, endY and on CPU this is equivalent to the following embedded for-loops:

for( Index j = startY; j < endY; j++ )
for( Index i = startX; i < endX; i++ )
f( i, j, args... );

where args... stand for additional arguments passed to the for-loop. After the parameters defining the loops bounds, lambda function (init in this case) is passed, followed by additional arguments that are forwarded to the lambda function after the iteration indices. In the example above there are no additional arguments, since the lambda function init captures all variables it needs to work with.

For completeness, we show modification of the previous example into 3D:

#include <iostream>
#include <TNL/Containers/Vector.h>
#include <TNL/Algorithms/ParallelFor.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
template< typename Device >
void initMeshFunction( const int xSize,
const int ySize,
const int zSize,
Vector< double, Device >& v,
const double& c )
{
auto view = v.getView();
auto init = [=] __cuda_callable__ ( int i, int j, int k ) mutable
{
view[ ( k * ySize + j ) * xSize + i ] = c;
};
ParallelFor3D< Device >::exec( 0, 0, 0, xSize, ySize, zSize, init );
}
int main( int argc, char* argv[] )
{
/***
* Define dimensions of a 3D mesh function.
*/
const int xSize( 10 ), ySize( 10 ), zSize( 10 );
const int size = xSize * ySize * zSize;
/***
* Firstly, test the mesh function initiation on CPU.
*/
Vector< double, Devices::Host > host_v;
initMeshFunction( xSize, ySize, zSize, host_v, 1.0 );
/***
* And then also on GPU.
*/
#ifdef HAVE_CUDA
Vector< double, Devices::Cuda > cuda_v( size );
initMeshFunction( xSize, ySize, cuda_v, 1.0 );
#endif
return EXIT_SUCCESS;
}

## Unrolled For

TNL::Algorithms::unrolledFor is a for-loop that it is explicitly unrolled via C++ templates when the loop is short (up to eight iterations). The bounds of unrolledFor loops must be constant (i.e. known at the compile time). It is often used with static arrays and vectors.

See the following example:

#include <iostream>
#include <TNL/Containers/StaticVector.h>
#include <TNL/Algorithms/unrolledFor.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create two static vectors
*/
const int Size( 3 );
StaticVector< Size, double > a, b;
a = 1.0;
b = 2.0;
double sum( 0.0 );
/****
* Compute an addition of a vector and a constant number.
*/
Algorithms::unrolledFor< int, 0, Size >(
[&]( int i ) {
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
}
);
std::cout << "a = " << a << std::endl;
std::cout << "sum = " << sum << std::endl;
}

Notice that the unrolled for-loop works with a lambda function similar to parallel for-loop. The bounds of the loop are passed as template parameters in the statement Algorithms::unrolledFor< int, 0, Size >. The parameter of the unrolledFor function is the functor to be called in each iteration. The function gets the loop index i only, see the following example:

The result looks as:

a = [ 5.14, 5.14, 5.14 ]
sum = 15.42

The effect of unrolledFor is really the same as usual for-loop. The following code does the same as the previous example:

for( int i = 0; i < Size; i++ )
{
a[ i ] = b[ i ] + 3.14;
sum += a[ i ];
};

The benefit of unrolledFor is mainly in the explicit unrolling of short loops which can improve performance in some situations. The maximum length of loops that will be fully unrolled can be specified using the fourth template parameter as follows:

Algorithms::unrolledFor< int, 0, Size, 16 >( ... );

unrolledFor can be used also in CUDA kernels.

## Static For

TNL::Algorithms::staticFor is a generic for-loop whose iteration indices are usable in constant expressions (e.g. template arguments). It can be used as

staticFor< int, 0, N >( f );

which results in the following sequence of function calls:

Notice that each iteration index is represented by its own distinct type using std::integral_constant. Hence, the functor f must be generic, e.g. a generic lambda expression such as in the following example:

#include <iostream>
#include <array>
#include <tuple>
#include <TNL/Algorithms/staticFor.h>
/*
* Example function printing members of std::tuple using staticFor
* using lambda with capture.
*/
template< typename... Ts >
void printTuple( const std::tuple<Ts...>& tupleVar )
{
std::cout << "{ ";
TNL::Algorithms::staticFor<size_t, 0, sizeof... (Ts)>( [&](auto i) {
if( i < sizeof... (Ts) - 1 )
std::cout << ", ";
});
std::cout << " }" << std::endl;
}
struct TuplePrinter
{
constexpr TuplePrinter() = default;
template< typename Index, typename... Ts >
void operator()( Index i, const std::tuple<Ts...>& tupleVar )
{
if( i < sizeof... (Ts) - 1 )
std::cout << ", ";
}
};
/*
* Example function printing members of std::tuple using staticFor
* and a structure with templated operator().
*/
template< typename... Ts >
void printTupleCallableStruct( const std::tuple<Ts...>& tupleVar )
{
std::cout << "{ ";
TNL::Algorithms::staticFor< size_t, 0, sizeof... (Ts) >( TuplePrinter(), tupleVar );
std::cout << " }" << std::endl;
}
int main( int argc, char* argv[] )
{
// initiate std::array
std::array< int, 5 > a{ 1, 2, 3, 4, 5 };
// print out the array using template parameters for indexing
TNL::Algorithms::staticFor< int, 0, 5 >(
[&a] ( auto i ) {
std::cout << "a[ " << i << " ] = " << std::get< i >( a ) << std::endl;
}
);
// example of printing a tuple using staticFor and a lambda function
printTuple( std::make_tuple( "Hello", 3, 2.1 ) );
// example of printing a tuple using staticFor and a structure with templated operator()
printTupleCallableStruct( std::make_tuple( "Hello", 3, 2.1 ) );
}
T make_tuple(T... args)
constexpr void staticFor(Func &&f, ArgTypes &&... args)
Generic loop with constant bounds and indices usable in constant expressions.
Definition: staticFor.h:118

The output looks as follows:

a[ 0 ] = 1
a[ 1 ] = 2
a[ 2 ] = 3
a[ 3 ] = 4
a[ 4 ] = 5
{ Hello, 3, 2.1 }
{ Hello, 3, 2.1 }