Template Numerical Library version develop:6514e4815
Arrays tutorial

Introduction

This tutorial introduces arrays in TNL. There are three types - common arrays with dynamic allocation, static arrays allocated on stack and distributed arrays with dynamic allocation. Arrays are one of the most important structures for memory management. Methods implemented in arrays are particularly useful for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialize data allocated on GPU. In addition, the resulting code is hardware platform independent, so it can be ran on CPU nad GPU without any changes.

Arrays

Array is templated class defined in namespace TNL::Containers having three template parameters:

  • Value is type of data to be stored in the array
  • Device is the device where the array is allocated. Currently it can be either Devices::Host for CPU or Devices::Cuda for GPU supporting CUDA.
  • Index is the type to be used for indexing the array elements.

The following example shows how to allocate arrays on CPU and GPU and how to initialize the data.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <list>
#include <vector>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create one array on host and one array on device.
*/
Array< int > host_array( 10 );
Array< int, Devices::Cuda > device_array;
/***
* Initiate the host array with number three and assign it to the device array.
* NOTE: Of course, you may do directly 'device_array = 3' as well.
*/
host_array = 3;
device_array = host_array;
/****
* Print both arrays.
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
/****
* There are few other ways how to initialize arrays...
*/
std::list< int > list { 1, 2, 3, 4, 5 };
std::vector< int > vector { 6, 7, 8, 9, 10 };
Array< int, Devices::Cuda > device_array_list( list );
Array< int, Devices::Cuda > device_array_vector( vector );
Array< int, Devices::Cuda > device_array_init_list { 11, 12, 13, 14, 15 };
/****
* ... and print them all
*/
std::cout << "device_array_list = " << device_array_list << std::endl;
std::cout << "device_array_vector = " << device_array_vector << std::endl;
std::cout << "device_array_init_list = " << device_array_init_list << std::endl;
}
T endl(T... args)
Namespace for TNL containers.
Definition: Array.h:25
The main TNL namespace.
Definition: AtomicOperations.h:22

The result looks as follows:

host_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array = [ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3 ]
device_array_list = [ 1, 2, 3, 4, 5 ]
device_array_vector = [ 6, 7, 8, 9, 10 ]
device_array_init_list = [ 11, 12, 13, 14, 15 ]

Array views

Arrays cannot share data with each other or data allocated elsewhere. This can be achieved with the ArrayView structure which has similar semantics to Array, but it does not handle allocation and deallocation of the data. Hence, array view cannot be resized, but it can be used to wrap data allocated elsewhere (e.g. using an Array or an operator new) and to partition large arrays into subarrays. The process of wrapping external data with a view is called binding.

The following code snippet shows how to create an array view.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new array
*/
const int size = 5;
Array< float > a( size );
/****
* Bind an array view with it
*/
ArrayView< float > a_view = a.getView();
auto another_view = a.getView();
auto const_view = a.getConstView();
another_view = -5;
std::cout << " a = " << a << std::endl;
std::cout << " a_view = " << a_view << std::endl;
std::cout << " another_view = " << another_view << std::endl;
std::cout << " const_view = " << const_view << std::endl;
//const_view = 3; this would not compile
}

The output is:

a = [ -5, -5, -5, -5, -5 ]
a_view = [ -5, -5, -5, -5, -5 ]
another_view = [ -5, -5, -5, -5, -5 ]
const_view = [ -5, -5, -5, -5, -5 ]

You can also bind external data into array view:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Allocate your own data
*/
const int size = 5;
float* a = new float[ size ];
/****
* Wrap the data with an array view
*/
ArrayView< float > a_view( a, size );
a_view = -5;
std::cout << " a_view = " << a_view << std::endl;
for( int i = 0; i < size; i++ )
std::cout << a[ i ] << " ";
/****
* Free the allocated memory
*/
delete[] a;
}

Output:

a_view = [ -5, -5, -5, -5, -5 ]
-5 -5 -5 -5 -5

Since array views do not allocate or deallocate memory, they can be created even in CUDA kernels, which is not possible with Array. ArrayView can also be passed-by-value into CUDA kernels or captured-by-value by device lambda functions, because the ArrayView's copy-constructor makes only a shallow copy (i.e., it copies only the data pointer and size).

Accessing the array elements

There are two ways how to work with the array (or array view) elements - using the indexing operator (operator[]) which is more efficient or using methods setElement and getElement which is more flexible.

Accessing the array elements with operator[]

Indexing operator operator[] is implemented in both Array and ArrayView and it is defined as __cuda_callable__. It means that it can be called even in CUDA kernels if the data is allocated on GPU, i.e. the Device parameter is Devices::Cuda. This operator returns a reference to given array element and so it is very efficient. However, calling this operator from host for data allocated on device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:

  • You may call the operator[] on the host only for data allocated on the host (with device Devices::Host).
  • You may call the operator[] on the device only for data allocated on the device (with device Devices::Cuda).

The following example shows use of operator[].

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
__global__ void initKernel( ArrayView< float, Devices::Cuda > view )
{
int tid = threadIdx.x;
if( tid < view.getSize() )
view[ tid ] = -tid;
}
int main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the host array
*/
for( int i = 0; i < size; i++ )
host_array[ i ] = i;
/****
* Prepare array view for the device array - we will pass it to a CUDA kernel.
* NOTE: Better way is to use ArrayView::evaluate or ParallelFor, this is just
* an example.
*/
auto device_view = device_array.getView();
/****
* Call CUDA kernel to initiate the array on the device
*/
initKernel<<< 1, size >>>( device_view );
/****
* Print the results
*/
std::cout << " host_array = " << host_array << std::endl;
std::cout << " device_array = " << device_array << std::endl;
}

Output:

host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, -1, -2, -3, -4 ]

In general in TNL, each method defined as __cuda_callable__ can be called from the CUDA kernels. The method ArrayView::getSize is another example. We also would like to point the reader to better ways of arrays initiation for example with method ArrayView::forElements or with ParallelFor.

Accessing the array elements with setElement and getElement

On the other hand, the methods setElement and getElement can be called from the host no matter where the array is allocated. In addition they can be called from kernels on device where the array is allocated. getElement returns copy of an element rather than a reference. Therefore it is slightly slower. If the array is on GPU and the methods are called from the host, the array element is copied from the device on the host (or vice versa) which is significantly slower. In the parts of code where the performance matters, these methods shall not be called from the host when the array is allocated on the device. In this way, their use is, however, easier compared to operator[] and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:

  • reading/writing of only few elements in the array
  • arrays initiation which is done only once and it is not time critical part of a code
  • debugging purposes

The following example shows the use of getElement and setElement:

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new arrays on both host and device
*/
const int size = 5;
Array< float, Devices::Host > host_array( size );
Array< float, Devices::Cuda > device_array( size );
/****
* Initiate the arrays with setElement
*/
for( int i = 0; i < size; i++ )
{
host_array.setElement( i, i );
device_array.setElement( i, i );
}
/****
* Compare the arrays using getElement
*/
for( int i = 0; i < size; i++ )
if( host_array.getElement( i ) == device_array.getElement( i ) )
std::cout << "Elements at position " << i << " match." << std::endl;
/****
* Print the results
*/
std::cout << "host_array = " << host_array << std::endl;
std::cout << "device_array = " << device_array << std::endl;
}

Output:

Elements at position 0 match.
Elements at position 1 match.
Elements at position 2 match.
Elements at position 3 match.
Elements at position 4 match.
host_array = [ 0, 1, 2, 3, 4 ]
device_array = [ 0, 1, 2, 3, 4 ]

Arrays and parallel for

More efficient and still quite simple method for (not only) array elements initiation is with the use of C++ lambda functions and methods forElements and forAllElements. As an argument a lambda function is passed which is then applied for all elements. Optionally one may define only subinterval of element indexes where the lambda shall be applied. If the underlying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than use of setElement. On the other hand, one must be careful to use only __cuda_callable__ methods inside the lambda. The use of the methods forElements and forAllElements is demonstrated in the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
template< typename Device >
void forElementsExample()
{
/****
* Create new arrays
*/
const int size = 10;
Containers::Array< float, Device > a( size ), b( size );
b = 0;
/****
* Initiate the elements of array `a`
*/
a.forAllElements( [] __cuda_callable__ ( int i, float& value ) { value = i; } );
/****
* Initiate elements of array `b` with indexes 0-4 using `a_view`
*/
auto a_view = a.getView();
b.forElements( 0, 5, [=] __cuda_callable__ ( int i, float& value ) { value = a_view[ i ] + 4.0; } );
/****
* Print the results
*/
std::cout << " a = " << a << std::endl;
std::cout << " b = " << b << std::endl;
}
int main( int argc, char* argv[] )
{
std::cout << "Running example on the host system: " << std::endl;
forElementsExample< Devices::Host >();
#ifdef HAVE_CUDA
std::cout << "Running example on the CUDA device: " << std::endl;
forElementsExample< Devices::Cuda >();
#endif
}

Output:

Running example on the host system:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]
Running example on the CUDA device:
a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
b = [ 4, 5, 6, 7, 8, 0, 0, 0, 0, 0 ]

Arrays and flexible reduction

Arrays also offer simpler way to do the flexible parallel reduction. See the section about the flexible parallel reduction to understand how it works. Flexible reduction for arrays just simplifies access to the array elements. See the following example:

Output:

Checking the array contents

The functions TNL::Algorithms::contains and TNL::Algorithms::containsOnlyValue serve for testing the contents of arrays, vectors or their views. contains returns true if there is at least one element in the array with given value. containsOnlyValue returns true only if all elements of the array are equal to the given value. The test can be restricted to a subinterval of array elements. See the following code snippet for usage example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Algorithms/contains.h>
using namespace TNL;
using namespace TNL::Containers;
using namespace TNL::Algorithms;
int main( int argc, char* argv[] )
{
/****
* Create new arrays and initiate them
*/
const int size = 10;
Array< float, Devices::Cuda > a( size ), b( size );
a = 0;
b.forAllElements( [=] __cuda_callable__ ( int i, float& value ) { value = i; } );
/****
* Test the values stored in the arrays
*/
if( contains( a, 0.0 ) )
std::cout << "a contains 0" << std::endl;
if( contains( a, 1.0 ) )
std::cout << "a contains 1" << std::endl;
if( contains( b, 0.0 ) )
std::cout << "b contains 0" << std::endl;
if( contains( b, 1.0 ) )
std::cout << "b contains 1" << std::endl;
if( containsOnlyValue( a, 0.0 ) )
std::cout << "a contains only 0" << std::endl;
if( containsOnlyValue( a, 1.0 ) )
std::cout << "a contains only 1" << std::endl;
if( containsOnlyValue( b, 0.0 ) )
std::cout << "b contains only 0" << std::endl;
if( containsOnlyValue( b, 1.0 ) )
std::cout << "b contains only 1" << std::endl;
/****
* Change the first half of b and test it again
*/
b.forElements( 0, 5, [=] __cuda_callable__ ( int i, float& value ) { value = 0.0; } );
if( containsOnlyValue( b, 0.0, 0, 5 ) )
std::cout << "First five elements of b contains only 0" << std::endl;
}
Namespace for fundamental TNL algorithms.
Definition: AtomicOperations.h:23
bool containsOnlyValue(const Array &array, typename Array::ValueType value, typename Array::IndexType begin=0, typename Array::IndexType end=0)
Checks if all elements of an array/vector/view have the given value.
Definition: contains.h:63
bool contains(const Array &array, typename Array::ValueType value, typename Array::IndexType begin=0, typename Array::IndexType end=0)
Checks if an array/vector/view contains an element with given value.
Definition: contains.h:35

Output:

a contains 0
b contains 0
b contains 1
a contains only 0
First five elements of b contains only 0

IO operations with arrays

Methods save and load serve for storing/restoring the array to/from a file in a binary form. In case of Array, loading of an array from a file causes data reallocation. ArrayView cannot do reallocation, therefore the data loaded from a file is copied to the memory managed by the ArrayView. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.

#include <iostream>
#include <TNL/Containers/Array.h>
#include <TNL/Containers/ArrayView.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
/****
* Create new arrays
*/
const int size = 15;
Array< float > a( size ), b( 5 );
a = 1;
b = 2;
/****
* Create array view for the middle of array a
*/
auto a_view = a.getView( 5, 10 );
/****
* Save array b to file
*/
b.save( "b.tnl" );
/****
* Load data from b to a_view
*/
a_view.load( "b.tnl" );
/****
* Print the results
*/
std::cout << "a = " << a << std::endl;
std::cout << "a_view = " << a_view << std::endl;
std::cout << "b = " << b << std::endl;
}

Output:

a = [ 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1 ]
a_view = [ 2, 2, 2, 2, 2 ]
b = [ 2, 2, 2, 2, 2 ]

Static arrays

Static arrays are allocated on stack and thus they can be created even in CUDA kernels. Their size is fixed and it is given by a template parameter. Static array is a templated class defined in namespace TNL::Containers having two template parameters:

  • Size is the array size.
  • Value is type of data stored in the array.

The interface of StaticArray is very smillar to Array but much simpler. It contains set of common constructors. Array elements can be accessed by the operator[] and also using method x(), y() and z() when it makes sense. See the following example for typical use of StaticArray.

#include <iostream>
#include <TNL/Containers/StaticArray.h>
#include <TNL/File.h>
using namespace TNL;
using namespace TNL::Containers;
int main( int argc, char* argv[] )
{
StaticArray< 3, int > a1, a2( 1, 2, 3 ), a3{ 4,3,2 };
a1 = 0.0;
std::cout << "a1 = " << a1 << std::endl;
std::cout << "a2 = " << a2 << std::endl;
std::cout << "a3 = " << a3 << std::endl;
File( "static-array-example-file.tnl", std::ios::out ) << a3;
File( "static-array-example-file.tnl", std::ios::in ) >> a1;
std::cout << "a1 = " << a1 << std::endl;
a1.sort();
std::cout << "Sorted a1 = " << a1 << std::endl;
}

The output looks as:

a1 = [ 0, 0, 0 ]
a2 = [ 1, 2, 3 ]
a3 = [ 4, 3, 2 ]
a1 = [ 4, 3, 2 ]
Sorted a1 = [ 2, 3, 4 ]

Distributed arrays

TODO