Writing vector tutorial.

776af14a · Tomáš Oberhuber · e2c84633 · 776af14a · 776af14a · 776af14a
Commit 776af14a authored 5 years ago by Tomáš Oberhuber
--- a/Documentation/Tutorials/Arrays/main-page.md.bak
+++ b/Documentation/Tutorials/Arrays/main-page.md.bak
+# Arrays tutorial
+
+## Introduction
+
+This tutorial introduces arrays in TNL. Array is one of the most important structure for memory management. Methods implemented in arrays are particularly usefull for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialise data allocated on GPU. In addition, the resulting code is hardware platform independent, so it can be ran on CPU without any changes.
+
+## Arrays
+
+Array is templated class defined in namespace ```TNL::Containers``` having three template parameters:
+
+* ```Value``` is type of data to be stored in the array
+* ```Device``` is the device wheer the array is allocated. Currently it can be either ```Devices::Host``` for CPU or ```Devices::Cuda``` for GPU supporting CUDA.
+* ```Index``` is the type to be used for indexing the array elements.
+
+The following example shows how to allocate arrays on CPU and GPU and how to manipulate the data.
+
+\include ArrayAllocation.cpp
+
+The result looks as follows:
+
+\include ArrayAllocation.out
+
+
+## Arrays binding
+
+Arrays can share data with each other or data allocated elsewhere. It is called binding and it can be done using method ```bind```. The following example shows how to bind data allocated on host using the ```new``` operator. In this case, the TNL array do not free this data at the and of its life cycle.
+
+\include ArrayBinding-1.cpp
+
+It generates output like this:
+
+\include ArrayBinding-1.out
+
+One may also bind another TNL array. In this case, the data is shared and can be shared between multiple arrays. Reference counter ensures that the data is freed after the last array sharing the data ends its life cycle. 
+
+\include ArrayBinding-2.cpp
+
+The result is:
+
+\include ArrayBinding-2.out
+
+Binding may also serve for data partitioning. Both CPU and GPU prefere data allocated in large contiguous blocks instead of many fragmented pieces of allocated memory. Another reason why one might want to partition the allocated data is demonstrated in the following example. Consider a situation of solving incompressible flow in 2D. The degrees of freedom consist of density and two components of velocity. Mostly, we want to manipulate either density or velocity. But some numerical solvers may need to have all degrees of freedom in one array. It can be managed like this:
+
+\include ArrayBinding-3.cpp
+
+The result is:
+
+\include ArrayBinding-3.out
+
+
+## Array views
+
+Because of the data sharing, TNL Array is relatively complicated structure. In many situations, we prefer lightweight structure which only encapsulates the data pointer and keeps information about the data size. Passing array structure to GPU kernel can be one example. For this purpose there is ```ArrayView``` in TNL. It templated structure having the same template parameters as ```Array``` (it means ```Value```, ```Device``` and ```Index```). In fact, it is recommended to use ```Array``` only for the data allocation and to use ```ArrayView``` for most of the operations with the data since array view offer better functionality (for example ```ArrayView``` can be captured by lambda functions in CUDA while ```Array``` cannot). The following code snippet shows how to create an array view.
+
+\include ArrayView-1.cpp
+
+Its output is:
+
+\include ArrayView-1.out
+
+Of course, one may bind his own data into array view:
+
+\include ArrayView-2.cpp
+
+Output:
+
+\include ArrayView-2.out
+
+Array view never allocated or deallocate the memory managed by it. Therefore it can be created even in CUDA kernels which is not true for ```Array```.
+
+## Accessing the array elements
+
+There are two ways how to work with the array (or array view) elements - using the indexing operator (```operator[]```) which is more efficient or methods ```setElement``` and ```getElement``` which is more flexible.
+
+### Accessing the array elements with ```operator[]```
+
+Indexing operator ```operator[]``` is implemented in both ```Array``` and ```ArrayView``` and it is defined as ```__cuda_callable__```. It means that it can be called even in CUDA kernels if the data is allocated on GPU, i.e. the ```Device``` parameter is ```Devicess::Cuda```. This operator returns a reference to given array element and so it is very efficient. However, calling this operator from host for data allocated in device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:
+
+* You may call the ```operator[]``` on the **host** only for data allocated on the **host** (with device ```Devices::Host```).
+* You may call the ```operator[]``` on the **device** only for data allocated on the **device** (with device ```Devices::Cuda```).
+
+The following example shows use of ```operator[]```.
+
+\include ElementsAccessing-1.cpp
+
+Output:
+
+\include ElementsAccessing-1.out
+
+In general in TNL, each method defined as ```__cuda_callable__``` can be called from the CUDA kernels. The method ```ArrayView::getSize``` is another example. We also would like to point the reader to better ways of arrays initiation for example with method ```ArrayView::evaluate``` or with ```ParalleFor```.
+
+### Accessing the array element with ```setElement``` and ```getElement```
+
+On the other hand, the methods ```setElement``` and ```getElement``` can be called **from the host only** no matter where the array is allocated. None of the methods can be used in CUDA kernels. ```getElement``` returns copy of an element rather than a reference. Therefore it is slightly slower. If the array is on GPU, the array element is copied from the device on the host (or vice versa) which is significantly slower. In those parts of code where the perfomance matters, these methods shall not be called. Their use is, however, much easier and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:
+
+* reading/wiriting of only few elements in the array
+* arrays inititation which is done only once and it is not time critical part of a code
+* debugging purposes
+
+The following example shows the use of ```getElement``` and ```setElement```:
+
+\include ElementsAccessing-2.cpp
+
+Output:
+
+\include ElementsAccessing-2.out
+
+## Arrays initiation with lambdas
+
+More eifficient and still quite simple method for the arrays initiation is with the use of C++ lambda functions and method ```evaluate```. This method is implemented in ```ArrayView``` only. As an argument a lambda function is passed which is then evaluated for all elemeents. Optionaly one may define only subinterval of element indexes where the lambda shall be evaluated. If the underlaying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than use of ```setElement```. On the other hand, one must be carefull to use only ```__cuda_callable__``` methods inside the lambda. The use of the method ```evaluate``` demonstrates the following example.
+
+\include ArrayViewEvaluate.cpp
+
+Output:
+
+\include ArrayViewEvaluate.out
+
+## Checking the array contents
+
+Methods ```containsValue``` and ```containsOnlyValue``` serve for testing the contents of the arrays. ```containsValue``` returns ```true``` of there is at least one element in the array with given value. ```containsOnlyValue``` returnd ```true``` only if all elements of the array equal given value. The test can be restricted to subinterval of array elements. Both methods are implemented in ```Array``` as well as in ```ArrayView```. See the following code snippet for example of use.
+
+\include ContainsValue.cpp
+
+Output:
+
+\include ContainsValue.out
+
+## IO operations with Arrays
+
+Methods ```save``` and ```load``` serve for storing/restoring the array to/from a file in binary form. In case of ```Array```, loading of an array from a file causes data reallocation. ```ArrayView``` cannot do reallocatation, therefore the data loaded from a file is copied to the memory managed by the ```ArrayView```. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.
+
+\include ArrayIO.cpp
+
+Output:
+
+\include ArrayIO.out
+
+
--- a/Documentation/Tutorials/Vectors/CMakeLists.txt
+++ b/Documentation/Tutorials/Vectors/CMakeLists.txt
-#IF( BUILD_CUDA )
-#   CUDA_ADD_EXECUTABLE( ArrayAllocation ArrayAllocation.cu )
-#   ADD_CUSTOM_COMMAND( COMMAND ArrayAllocation > ArrayAllocation.out OUTPUT ArrayAllocation.out )
-#ENDIF()
+IF( BUILD_CUDA )
+   #ADD_EXECUTABLE( Expressions Expressions.cpp )
+   CUDA_ADD_EXECUTABLE( Expressions Expressions.cu )
+   ADD_CUSTOM_COMMAND( COMMAND Expressions > Expressions.out OUTPUT Expressions.out )
+ENDIF()

-#IF( BUILD_CUDA )
-#ADD_CUSTOM_TARGET( TutorialsVectors-cuda ALL DEPENDS
-#   ArrayViewEvaluate.out )
-#ENDIF()
+IF( BUILD_CUDA )
+ADD_CUSTOM_TARGET( TutorialsVectors-cuda ALL DEPENDS
+   Expressions.out )
+ENDIF()

 # set input and output files
 set(DOXYGEN_IN ${CMAKE_CURRENT_SOURCE_DIR}/Doxyfile.in)
@@ -22,4 +23,4 @@ add_custom_target( doc_doxygen_tutorial_vectors ALL
    COMMENT "Generating API documentation with Doxygen"
    VERBATIM )

-INSTALL( DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/html/ DESTINATION ${CMAKE_INSTALL_PREFIX}/share/doc/tnl/html/Tutorials/Arrays )
\ No newline at end of file
+INSTALL( DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/html/ DESTINATION ${CMAKE_INSTALL_PREFIX}/share/doc/tnl/html/Tutorials/Vectors )
\ No newline at end of file
--- a/Documentation/Tutorials/Vectors/Expressions.cpp
+++ b/Documentation/Tutorials/Vectors/Expressions.cpp
+#include <iostream>
+#include <TNL/Containers/Vector.h>
+#include <TNL/Containers/VectorView.h>
+
+using namespace TNL;
+using namespace TNL::Containers;
+
+template< typename Device >
+void expressions()
+{
+   using VectorType = Vector< float, Device >;
+   using ViewType = VectorView< float, Device >;
+   /****
+    * Create vectors
+    */
+   const int size = 6;
+   VectorType a_v( size ), b_v( size ), c_v( size );
+   ViewType a = a_v.getView();
+   ViewType b = b_v.getView();
+   ViewType c = c_v.getView();
+   a.evaluate( [] __cuda_callable__ ( int i )->float { return i - 3;} );
+   b = abs( a );
+   c = sign( b );
+
+   std::cout << "a = " << a << std::endl;
+   std::cout << "b = " << b << std::endl;
+   std::cout << "c = " << c << std::endl;
+   std::cout << "a + 3 * b + c * min( c, 0 ) = " <<  a + 3 * b + c * min( c, 0 ) << std::endl;
+}
+
+int main( int argc, char* argv[] )
+{
+   /****
+    * Perform test on CPU
+    */
+   std::cout << "Expressions on CPU ..." << std::endl;
+   expressions< Devices::Host >();
+
+   /****
+    * Perform test on GPU
+    */
+   std::cout << "Expressions on GPU ..." << std::endl;
+   expressions< Devices::Cuda >();
+}
+
+
--- a/Documentation/Tutorials/Vectors/Expressions.cu
+++ b/Documentation/Tutorials/Vectors/Expressions.cu
+Expressions.cpp
\ No newline at end of file
--- a/Documentation/Tutorials/Vectors/main-page.md
+++ b/Documentation/Tutorials/Vectors/main-page.md
@@ -2,13 +2,31 @@

 ## Introduction

-This tutorial introduces vectors in TNL. ```Vector```, in addition to ```Array```, offers also basic operations from linear algebra. Methods implemented in arrays and vectors are particularly usefull for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialise data allocated on GPU and perform parallel reduction and vector operations without writting low-level CUDA kernels. In addition, the resulting code is hardware platform independent, so it can be ran on CPU without any changes.
+This tutorial introduces vectors in TNL. `Vector`, in addition to `Array`, offers also basic operations from linear algebra. Methods implemented in arrays and vectors are particularly usefull for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialise data allocated on GPU and perform parallel reduction and vector operations without writting low-level CUDA kernels. In addition, the resulting code is hardware platform independent, so it can be ran on CPU without any changes.

-## Vectors
+# Table of Contents
+1. [Vectors](#vectors)
+2. [Static vectors](#static_vectors)

-```Vector``` is, similar to ```Array``` templated class defined in namespace ```TNL::Containers``` having three template parameters:
+## Vectors <a name="vectors"></a>

-* ```Value``` is type of data to be stored in the array
-* ```Device``` is the device wheer the array is allocated. Currently it can be either ```Devices::Host``` for CPU or ```Devices::Cuda``` for GPU supporting CUDA.
-* ```Index``` is the type to be used for indexing the array elements.
+`Vector` is, similar to `Array`, templated class defined in namespace `TNL::Containers` having three template parameters:

+* `Real` is type of data to be stored in the vector
+* `Device` is the device where the vector is allocated. Currently it can be either `Devices::Host` for CPU or `Devices::Cuda` for GPU supporting CUDA.
+* `Index` is the type to be used for indexing the vector elements.
+
+`Vector`, unlike `Array`, requires that the `Real` type is numeric or a type for which basic algebraic operations are defined. What kind of algebraic operations is required depends on what vector operations the user will call. `Vector` is derived from `Array` so it inherits all its methods. In the same way the `Array` has its counterpart `ArraView`, `Vector` has `VectorView` which is derived from `ArrayView`. We refer to to [Arrays tutorial](../Arrays/index.html) for more details.
+
+### Vector expressions
+
+Vector expressions in TNL are processed by the [Expression Templates](https://en.wikipedia.org/wiki/Expression_templates). It makes algebraic operations with vectors easy to do and very efficient at the same time. In some cases, one get even more efficient code compared to [Blas](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) and [cuBlas](https://developer.nvidia.com/cublas). See the following example to learn how simple it is.
+
+\include Expressions.cpp
+
+Output is:
+
+\include Expressions.out
+
+
+## Static vectors <a name="static_vectors"></a>