Template Numerical Library version develop:6514e4815
General concepts

Introduction

In this part we describe some general and core concepts of programming with TNL. Understanding these ideas may significantly help to understand the design of TNL algorithms and data structure and it also helps to use TNL more efficiently. The main goal of TNL is to allow developing high performance algorithms that could run on multicore CPUs and GPUs. TNL offers unified interface and so the developer writes one code for both architectures.

Devices and allocators

TNL offers unified interface for both CPUs (also referred as a host system) and GPUs (referred as device). Connection between CPU and GPU is usually represented by PCI-Express bus which is orders of magnitude slower compared to speed of the global memory of GPU. Therefore, the communication between CPU and GPU must be reduced as much as possible. As a result, the programmer operates with two different address spaces, one for CPU and one for GPU. To distinguish between the address spaces, each data structure requiring dynamic allocation of memory needs to now on what device it resides. This is done by a template parameter Device. For example the following code creates two arrays, one on CPU and the other on GPU

Array is responsible for memory management, access to array elements, and general array operations.
Definition: Array.h:73

Since now, C++ template sepcialization takes care of using the right methods for given device (in meaning hardware architecture and so the device can be even CPU). For example, calling a method setSize

1host_array.setSize( 10 );
2cuda_array.setSize( 10 );
void setSize(IndexType size)
Method for setting the array size.
Definition: Array.hpp:326

results in different memory allocation on CPU (for host_array) and on GPU (for cuda_array). The same holds for assignment

1cuda_array = host_aray;

in which case appropriate data transfer from CPU to GPU is performed. Each such data structure contains inner type named DeviceType which tells where it resides as we can see here:

1template< typename Array >
2void deduceDevice
3{
4 using Device = typename Array::DeviceType;
5 TNL::Container::Array< int, Device > array;
6}

If we need to specialize some parts of algorithm with respect to its device we can do it by means of std::is_same :

TODO: Allocators

Algorithms and lambda functions

Developing a code for GPUs (in CUDA for example) consists mainly of writing kernels which are special functions running on GPU in parallel. This can be very hard and tedious work especially when it comes to debugging. Parallel reduction is a perfect example of an algorithm which is relatively hard to understand and implement on one hand but it is necessary to use frequently. Writing tens of lines of code every time we need to sum up some data is exactly what we mean by tedious programming. TNL offers skeletons or patterns of such algorithms and combines them with user defined lambda functions. This approach is not absolutely general, which means that you can use it only in situation when there is a skeleton/pattern (see TNL::Algorithms) suitable for your problem. But when there is, it offers several advantages:

  1. Implementing lambda functions is much easier compared to implementing GPU kernels.
  2. Code implemented this way works even on CPU, so the developer writes only one code for both hardware architectures.
  3. The developer may debug the code on CPU first and then just run it on GPU. Quite likely it will work with only a little or no changes.

The following code snippet demonstrates it on use of TNL::Algorithms::ParallelFor:

1template< typename Device >
2void vectorAddition( double* v1, double* v2, double* sum, const int size )
3{
4 auto sum_lambda = [=] __cuda_callable__ ( int i ) mutable {
5 sum[ i ] = v1[ i ] + v2[ i ];
6 }
7 TNL::Algorithms::ParalellFor< Device >::exec( 0, size, sum_lambda );
8}

In this example, we assume that all arrays v1, v2 and sum were properly allocated on given Device. If Device equals TNL::Devices::Host , the lambda function is processed sequentially or in parallel by several OpenMP threads on CPU. If Device equals TNL::Devices::Cuda , the lambda function is called from CUDA kernel (this is why it is defined as __cuda_callable__ which is just a substitute for __host__ __device__ ) by apropriate number of CUDA threads. One more example demonstrates use of TNL::Algorithms::Reduction .

1template< typename Device >
2void scalarProduct( double* v1, double* v2, double* product, const int size )
3{
4 auto fetch = [=] __cuda_callable__ ( int i ) -> double {
5 return = v1[ i ] * v2[ i ];
6 }
7 auto reduce = [] __cuda_callable__ ( const double& a, const double& b ) {
8 return a + b; };
9 TNL::Algorithms::reduce< Device >( 0, size, fetch, reduce, 0.0 );
10}
Result reduce(const Index begin, const Index end, Fetch &&fetch, Reduction &&reduction, const Result &identity)
reduce implements (parallel) reduction for vectors and arrays.
Definition: reduce.h:78

We will not explain the parallel reduction in TNL at this moment (see the section about flexible parallel reduction ), we hope that the idea is more or less clear from the code snippet. If Device equals to TNL::Device::Host , the scalar product is evaluated sequentially or in parallel by several OpenMP threads on CPU, if Device equals TNL::Algorithms::Cuda, the parallel reduction fine tuned with the lambda functions is performed. Fortunately, there is no performance drop. On the contrary, since it is easy to generate CUDA kernels for particular situations, we may get more efficient code. Consider computing a scalar product of sum of vectors like this

\[ s = (u_1 + u_2, v_1 + v_2). \]

This can be solved by the following code

1template< typename Device >
2void scalarProduct( double* u1, double* u2,
3 double* v1, double* v2,
4 double* product, const int size )
5{
6 auto fetch = [=] __cuda_callable__ ( int i ) -> double {
7 return = ( u1[ i ] + u2[ i ] ) * ( v1[ i ] + v2[ i ] );
8 }
9 auto reduce = [] __cuda_callable__ ( const double& a, const double& b ) {
10 return a + b; };
11 TNL::Algorithms::reduce< Device >( 0, size, fetch, reduce, 0.0 );
12}

We have changed only the fetch lambda function to perform the sums of u1[ i ] + u2[ i ] and v1[ i ] + v2[ i ] (line 7). Now we get completely new CUDA kernel tailored exactly for our problem. Doing the same with Cublas, for example, would require splitting into three separate kernels:

  1. Kernel to compute \(u_1 = u_1 + u_2\).
  2. Kernel to compute \(v_1 = v_1 + v_2\).
  3. Kernel to compute \(product = ( u_1, v_1 )\).

This could be achieved with the following code:

1void scalarProduct( double* u1, double* u2,
2 double* v1, double* v2,
3 double* product, const int size )
4{
5 cublasHandle_t handle;
6 cublasSaxpy( handle, size, 1.0, u1, 1, u2, 1 );
7 cublasSaxpy( handle, size, 1.0, v1, 1, v2, 1 );
8 cublasSdot ( handle, size, 1.0, u1, 1, v1, 1, &product );
9}

We believe that C++ lambda functions with properly designed patterns of parallel algorithms could make programming of GPUs significantly easier. We see a parallel with MPI standard which in nineties defined frequent communication operations in distributed parallel computing. It made programming of distributed systems much easier and at the same time MPI helps to write efficient programs. We aim to add additional skeletons or patterns to TNL::Algorithms.

Shared pointers and views

You might notice that in the previous section we used only C style arrays represented by pointers in the lambda functions. There is a difficulty when we want to access TNL arrays or other data structures inside the lambda functions. We may capture the outside variables either by a value or a reference. The first case would be as follows:

1template< typename Device >
2void lambda_capture_by_value( int size )
3{
5 auto f = [=] __cuda_callable__ ( int i ) mutable {
6 a[ i ] = 1;
7 };
9}
10
11
static void exec(Index start, Index end, Function f, FunctionArgs... args)
Static method for the execution of the loop.
Definition: ParallelFor.h:90

In this case a deep copy of array a will be made and so there will be no effect of what we do with the array a in the lambda function. Capturing by a reference may look as follows:

1template< typename Device >
2void lambda_capture_by_value( int size )
3{
5 auto f = [&a] __cuda_callable__ ( int i ) mutable {
6 a[ i ] = 1;
7 };
9}
10
11

This would be correct on CPU (i.e. when Device is TNL::Devices::Host ). However, we are not allowed to pass references to CUDA kernels and so this source code would not even compile with CUDA compiler. To overcome this issue, TNL offers two solutions:

  1. Data structures views
  2. Shared pointers

Data structures views

View is a kind of lightweight reference object which makes only a shallow copy of itself in copy constructor. Therefore view can by captured by value, but because it is, in fact, a reference to another object, everything we do with the view will affect the original object. The example with the array would look as follows:

1template< typename Device >
2void lambda_capture_by_value( int size )
3{
5 auto view = a.getView();
6 auto f = [=] __cuda_callable__ ( int i ) mutable {
7 view[ i ] = 1;
8 };
10}

The differences are on the line 5 where we fetch the view by means of method getView and on the line 7 where we work with the view and not with the array a. The view has very similar interface (see TNL::Containers::ArrayView) as the array (TNL::Containers::Array) and so mostly there is no difference in using array and its view for the programmer. In TNL, each data structure which can be accessed from GPU kernels (it means that it has methods defined as __cuda_callable__) provides also a method getView for getting appropriate view of the object.

Views are simple objects because they must be transferred to GPU in each kernel call. So there are no smart links between a view and the original object. In fact, the array view contains just a pointer the data managed by the array and the size of the array. Therefore if the original object get changed, all views obtained from the object before may become invalid. See the following example:

1template< typename Device >
2void lambda_capture_by_value( int size )
3{
5 auto view = a.getView();
6 a.setSize( 2 * size );
7 auto f = [=] __cuda_callable__ ( int i ) mutable {
8 view[ i ] = 1;
9 };
11}

Such code would not work because after obtaining the view on the line 5, we change the size of the array a which will cause data reallocation. As we mentioned, there is no pointer from the view to the array and so the view has no chance to check if it is still up-to-date with the original object. However, if you fetch all necessary views immediately before capturing by a lambda function, there will be no problem. And this is why the views are recommended for accessing TNL data structures in lambda functions or GPU kernels.

Note, that changing the data managed by the array after fetching the view is not an issue. See the following example:

1template< typename Device >
2void lambda_capture_by_value( int size )
3{
5 auto view = a.getView();
6 a.setElement( 0, 1 );
7 auto f = [=] __cuda_callable__ ( int i ) mutable {
8 view[ i ] = 1;
9 };
11}

On the line 6, we change value of the first element. This causes no data reallocation or change of size and so the view fetched on the line 5 is still valid and up-to-date.

Shared pointers

TNL offers smart pointers working across different devices (meaning CPU or GPU).