Documentation: enable automatic table of contents (4ce66fc2) · Commits · TNL / tnl-dev

Documentation/Doxyfile

+2 −2

Original line number	Diff line number	Diff line
		@@ -312,7 +312,7 @@ MARKDOWN_SUPPORT = YES
		# Minimum value: 0, maximum value: 99, default value: 0.
		# This tag requires that the tag MARKDOWN_SUPPORT is set to YES.

		TOC_INCLUDE_HEADINGS = 0
		TOC_INCLUDE_HEADINGS = 3

		# When enabled doxygen tries to link words that correspond to documented
		# classes, or namespaces to their corresponding documentation. Such a link can
		@@ -1211,7 +1211,7 @@ HTML_STYLESHEET =
		# list). For an example see the documentation.
		# This tag requires that the tag GENERATE_HTML is set to YES.

		HTML_EXTRA_STYLESHEET =
		HTML_EXTRA_STYLESHEET = custom_style.css

		# The HTML_EXTRA_FILES tag can be used to specify one or more extra images or
		# other source files which should be copied to the HTML output directory. Note

Documentation/Tutorials/Arrays/tutorial_Arrays.md

+14 −25

Original line number	Diff line number	Diff line
		\page tutorial_Arrays Arrays tutorial

		## Table of Contents
		- [Table of Contents](#table-of-contents)
		- [Introduction](#introduction)
		- [Arrays<a name="arrays"></a>](#arrays)
		- [Array views<a name="array-views"></a>](#array-views)
		- [Accessing the array elements<a name="accessing-the-array-elements"></a>](#accessing-the-array-elements)
		- [Accessing the array elements with `operator[]`<a name="accessing-the-array-elements-with-operator"></a>](#accessing-the-array-elements-with-operator)
		- [Accessing the array elements with `setElement` and `getElement`<a name="accessing-the-array-elements-with-setelement-and-getelement"></a>](#accessing-the-array-elements-with-setelement-and-getelement)
		- [Arrays and parallel for<a name="arrays-initiation-with-lambdas"></a>](#arrays-and-parallel-for)
		- [Arrays and flexible reduction<a name="arrays-initiation-with-lambdas"></a>](#arrays-and-flexible-reduction)
		- [Checking the array contents<a name="checking-the-array-contents"></a>](#checking-the-array-contents)
		- [IO operations with arrays<a name="io-operations-with-arrays"></a>](#io-operations-with-arrays)
		- [Static arrays<a name="static-arrays"></a>](#static-arrays)
		- [Distributed arrays<a name="distributed-arrays"></a>](#distributed-arrays)
		[TOC]

		## Introduction

		This tutorial introduces arrays in TNL. There are three types - common arrays with dynamic allocation, static arrays allocated on stack and distributed arrays with dynamic allocation. Arrays are one of the most important structures for memory management. Methods implemented in arrays are particularly useful for GPU programming. From this point of view, the reader will learn how to easily allocate memory on GPU, transfer data between GPU and CPU but also, how to initialize data allocated on GPU. In addition, the resulting code is hardware platform independent, so it can be ran on CPU nad GPU without any changes.

		## Arrays<a name="arrays"></a>
		## Arrays

		Array is templated class defined in namespace `TNL::Containers` having three template parameters:

		@@ -36,7 +23,7 @@ The result looks as follows:
		\include ArrayAllocation.out


		### Array views<a name="array-views"></a>
		### Array views

		Arrays cannot share data with each other or data allocated elsewhere. This can be achieved with the `ArrayView` structure which has similar semantics to `Array`, but it does not handle allocation and deallocation of the data. Hence, array view cannot be resized, but it can be used to wrap data allocated elsewhere (e.g. using an `Array` or an operator `new`) and to partition large arrays into subarrays. The process of wrapping external data with a view is called _binding_.

		@@ -58,11 +45,11 @@ Output:

		Since array views do not allocate or deallocate memory, they can be created even in CUDA kernels, which is not possible with `Array`. `ArrayView` can also be passed-by-value into CUDA kernels or captured-by-value by device lambda functions, because the `ArrayView`'s copy-constructor makes only a shallow copy (i.e., it copies only the data pointer and size).

		### Accessing the array elements<a name="accessing-the-array-elements"></a>
		### Accessing the array elements

		There are two ways how to work with the array (or array view) elements - using the indexing operator (`operator[]`) which is more efficient or using methods `setElement` and `getElement` which is more flexible.

		#### Accessing the array elements with `operator[]`<a name="accessing-the-array-elements-with-operator"></a>
		#### Accessing the array elements with `operator[]`

		Indexing operator `operator[]` is implemented in both `Array` and `ArrayView` and it is defined as `__cuda_callable__`. It means that it can be called even in CUDA kernels if the data is allocated on GPU, i.e. the `Device` parameter is `Devices::Cuda`. This operator returns a reference to given array element and so it is very efficient. However, calling this operator from host for data allocated on device (or vice versa) leads to segmentation fault (on the host system) or broken state of the device. It means:

		@@ -79,7 +66,7 @@ Output:

		In general in TNL, each method defined as `__cuda_callable__` can be called from the CUDA kernels. The method `ArrayView::getSize` is another example. We also would like to point the reader to better ways of arrays initiation for example with method `ArrayView::forElements` or with `ParallelFor`.

		#### Accessing the array elements with `setElement` and `getElement`<a name="accessing-the-array-elements-with-setelement-and-getelement"></a>
		#### Accessing the array elements with `setElement` and `getElement`

		On the other hand, the methods `setElement` and `getElement` can be called from the host no matter where the array is allocated. In addition they can be called from kernels on device where the array is allocated. `getElement` returns copy of an element rather than a reference. Therefore it is slightly slower. If the array is on GPU and the methods are called from the host, the array element is copied from the device on the host (or vice versa) which is significantly slower. In the parts of code where the performance matters, these methods shall not be called from the host when the array is allocated on the device. In this way, their use is, however, easier compared to `operator[]` and they allow to write one simple code for both CPU and GPU. Both methods are good candidates for:

		@@ -95,7 +82,7 @@ Output:

		\include ElementsAccessing-2.out

		### Arrays and parallel for<a name="arrays-initiation-with-lambdas"></a>
		### Arrays and parallel for

		More efficient and still quite simple method for (not only) array elements initiation is with the use of C++ lambda functions and methods `forElements` and `forEachElement`. As an argument a lambda function is passed which is then applied for all elements. Optionally one may define only subinterval of element indexes where the lambda shall be applied. If the underlying array is allocated on GPU, the lambda function is called from CUDA kernel. This is why it is more efficient than use of `setElement`. On the other hand, one must be careful to use only `__cuda_callable__` methods inside the lambda. The use of the methods `forElements` and `forEachElement` is demonstrated in the following example.

		@@ -105,7 +92,7 @@ Output:

		\include ArrayExample_forElements.out

		### Arrays and flexible reduction<a name="arrays-initiation-with-lambdas"></a>
		### Arrays and flexible reduction

		Arrays also offer simpler way to do the flexible parallel reduction. See the section about [the flexible parallel reduction](tutorial_ReductionAndScan.html#flexible_parallel_reduction) to understand how it works. Flexible reduction for arrays just simplifies access to the array elements. See the following example:

		@@ -116,7 +103,7 @@ Output:
		\include ArrayExample_reduceElements.out


		### Checking the array contents<a name="checking-the-array-contents"></a>
		### Checking the array contents

		Methods `containsValue` and `containsOnlyValue` serve for testing the contents of the arrays. `containsValue` returns `true` of there is at least one element in the array with given value. `containsOnlyValue` returns `true` only if all elements of the array equal given value. The test can be restricted to subinterval of array elements. Both methods are implemented in `Array` as well as in `ArrayView`. See the following code snippet for example of use.

		@@ -126,7 +113,7 @@ Output:

		\include ContainsValue.out

		### IO operations with arrays<a name="io-operations-with-arrays"></a>
		### IO operations with arrays

		Methods `save` and `load` serve for storing/restoring the array to/from a file in a binary form. In case of `Array`, loading of an array from a file causes data reallocation. `ArrayView` cannot do reallocation, therefore the data loaded from a file is copied to the memory managed by the `ArrayView`. The number of elements managed by the array view and those loaded from the file must be equal. See the following example.

		@@ -136,7 +123,7 @@ Output:

		\include ArrayIO.out

		## Static arrays<a name="static-arrays"></a>
		## Static arrays

		Static arrays are allocated on stack and thus they can be created even in CUDA kernels. Their size is fixed and it is given by a template parameter. Static array is a templated class defined in namespace `TNL::Containers` having two template parameters:

		@@ -151,4 +138,6 @@ The output looks as:

		\include StaticArrayExample.out

		## Distributed arrays<a name="distributed-arrays"></a>
		## Distributed arrays

		TODO

Documentation/Tutorials/ForLoops/tutorial_ForLoops.md

+8 −12

Original line number	Diff line number	Diff line
		\page tutorial_ForLoops For loops

		[TOC]

		## Introduction

		This tutorial shows how to use different kind of for loops implemented in TNL. Namely, they are:
		@@ -9,13 +11,7 @@ This tutorial shows how to use different kind of for loops implemented in TNL. N
		* Static For is a for loop which is performed sequentialy and it is explicitly unrolled by C++ templates. Number of iterations must be static (known at compile time).
		* Templated Static For ....

		## Table of Contents
		1. [Parallel For](#parallel_for)
		2. [n-dimensional Parallel For](#n_dimensional_parallel_for)
		3. [Static For](#static_for)
		4. [Templated Static For](#templated_static_for)

		## Parallel For<a name="parallel_for"></a>
		## Parallel For

		Basic parallel for construction in TNL serves for hardware platform transparent expression of parallel for loops. The hardware platform is expressed by a template parameter. The parallel for is defined as:

		@@ -31,7 +27,7 @@ The result is:

		\include ParallelForExample.out

		## n-dimensional Parallel For<a name="n_dimensional_parallel_for"></a>
		## n-dimensional Parallel For

		Performing for-loops in higher dimensions is simillar. In the following example we build 2D mesh function on top of TNL vector. Two dimensional indexes `( i, j )` are mapped to vector index `idx` as `idx = j * xSize + i`, where the mesh fuction has dimensions `xSize * ySize`. Of course, in this simple example, it does not make any sense to compute a sum of two mesh function this way, it is only an example.

		@@ -47,7 +43,7 @@ For the completness, we show modification of the previous example into 3D:

		\include ParallelForExample-3D_ug.cpp

		## Static For<a name="static_for"></a>
		## Static For

		Static for-loop is designed for short loops with constant (i.e. known at the compile time) number of iterations. It is often used with static arrays and vectors. An adventage of this kind of for loop is that it is explicitly unrolled when the loop is short (up to eight iterations). See the following example:

		@@ -69,7 +65,7 @@ The benefit of `StaticFor` is mainly in the explicit unrolling of short loops wh

		`StaticFor` can be used also in CUDA kernels.

		## Templated Static For<a name="templated_static_for"></a>
		## Templated Static For

		Templated static for-loop (`TemplateStaticFor`) is a for-loop in template parameters. For example, if class `LoopBody` is defined as

Documentation/Tutorials/GeneralConcepts/tutorial_GeneralConcepts.md

+6 −13

Original line number	Diff line number	Diff line
		\page tutorial_GeneralConcepts General concepts

		## Table of Contents
		- [Table of Contents](#table-of-contents)
		- [Introduction](#introduction)
		- [Devices and allocators<a name="devices-and-allocators"></a>](#devices-and-allocators)
		- [Algorithms and lambda functions<a name="algorithms-and-lambda-functions"></a>](#algorithms-and-lambda-functions)
		- [Shared pointers and views<a name="shared-pointers-and-views"></a>](#shared-pointers-and-views)
		- [Data structures views<a name="data-structures-views"></a>](#data-structures-views)
		- [Shared pointers<a name="shared-pointers"></a>](#shared-pointers)
		[TOC]

		## Introduction

		In this part we describe some general and core concepts of programming with TNL. Understanding these ideas may significantly help to understand the design of TNL algorithms and data structure and it also helps to use TNL more efficiently. The main goal of TNL is to allow developing high performance algorithms that could run on multicore CPUs and GPUs. TNL offers unified interface and so the developer writes one code for both architectures.

		## Devices and allocators<a name="devices-and-allocators"></a>
		## Devices and allocators

		TNL offers unified interface for both CPUs (also referred as a host system) and GPUs (referred as device). Connection between CPU and GPU is usually represented by [PCI-Express bus](https://en.wikipedia.org/wiki/PCI_Express) which is orders of magnitude slower compared to speed of the global memory of GPU. Therefore, the communication between CPU and GPU must be reduced as much as possible. As a result, the programmer operates with two different address spaces, one for CPU and one for GPU. To distinguish between the address spaces, each data structure requiring dynamic allocation of memory needs to now on what device it resides. This is done by a template parameter `Device`. For example the following code creates two arrays, one on CPU and the other on GPU

		@@ -37,7 +30,7 @@ If we need to specialize some parts of algorithm with respect to its device we c

		TODO: Allocators

		## Algorithms and lambda functions<a name="algorithms-and-lambda-functions"></a>
		## Algorithms and lambda functions

		Developing a code for GPUs (in [CUDA](https://developer.nvidia.com/CUDA-zone) for example) consists mainly of writing [kernels](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels) which are special functions running on GPU in parallel. This can be very hard and tedious work especially when it comes to debugging. [Parallel reduction](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) is a perfect example of an algorithm which is relatively hard to understand and implement on one hand but it is necessary to use frequently. Writing tens of lines of code every time we need to sum up some data is exactly what we mean by tedious programming. TNL offers skeletons or patterns of such algorithms and combines them with user defined [lambda functions](https://en.cppreference.com/w/cpp/language/lambda). This approach is not absolutely general, which means that you can use it only in situation when there is a skeleton/pattern (see \ref TNL::Algorithms) suitable for your problem. But when there is, it offers several advantages:

		@@ -75,7 +68,7 @@ This could be achieved with the following code:

		We believe that C++ lambda functions with properly designed patterns of parallel algorithms could make programming of GPUs significantly easier. We see a parallel with [MPI standard](https://en.wikipedia.org/wiki/Message_Passing_Interface) which in nineties defined frequent communication operations in distributed parallel computing. It made programming of distributed systems much easier and at the same time MPI helps to write efficient programs. We aim to add additional skeletons or patterns to \ref TNL::Algorithms.

		## Shared pointers and views<a name="shared-pointers-and-views"></a>
		## Shared pointers and views

		You might notice that in the previous section we used only C style arrays represented by pointers in the lambda functions. There is a difficulty when we want to access TNL arrays or other data structures inside the lambda functions. We may capture the outside variables either by a value or a reference. The first case would be as follows:

		@@ -90,7 +83,7 @@ This would be correct on CPU (i.e. when `Device` is \ref TNL::Devices::Host ). H
		1. Data structures views
		2. Shared pointers

		### Data structures views<a name="data-structures-views"></a>
		### Data structures views

		View is a kind of lightweight reference object which makes only a shallow copy of itself in copy constructor. Therefore view can by captured by value, but because it is, in fact, a reference to another object, everything we do with the view will affect the original object. The example with the array would look as follows:

		@@ -110,6 +103,6 @@ Note, that changing the data managed by the array after fetching the view is not

		On the line 6, we change value of the first element. This causes no data reallocation or change of size and so the view fetched on the line 5 is still valid and up-to-date.

		### Shared pointers<a name="shared-pointers"></a>
		### Shared pointers

		TNL offers smart pointers working across different devices (meaning CPU or GPU).

Documentation/Tutorials/Matrices/tutorial_Matrices.md

+27 −50

File changed.

Preview size limit exceeded, changes collapsed.