moved section on multidimensional arrays into a separate file (8fb93080) · Commits · Jakub Klinkovský / Dissertation

content/document.tex

+1 −110

Original line number	Diff line number	Diff line
		@@ -44,116 +44,7 @@ In this chapter, we present several data structures which are flexible in the se

		\section{Multidimensional arrays}
		\label{sec:ndarray}
		\inline{Podrobnosti bych vynechal, je to popsané v diplomce a na doktorátu jsem s tím nic moc nedělal... (pouze distribuovanou verzi -- což se ale velmi hodí pro LBM i s tím synchronizerem)}
		\inline{Comparison with other libraries}

		Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices and multidimensional arrays are a natural data structure for representing such values in the computer memory.
		Since the \C++ language supports only one-dimensional arrays natively, multidimensional arrays have to be implemented explicitly (e.g., in a library) based on a mapping of multidimensional data to an internal one-dimensional array.

		An interesting problem is how to choose the mapping from the multidimensional index space into the one-dimensional index space.
		Even for two-dimensional arrays (i.e., matrices) it can be argued whether they should be stored in the \emph{row-major} format, where rows are stored as 1D arrays, or in the \emph{column-major} format, where columns are stored as 1D arrays.
		The optimal choice depends on the operations that we want to do with the data, as well as on the hardware architecture that will process the data.
		For example, the row-major format is suitable for algorithms processing a matrix row by row on the CPU, while for GPU algorithms also processing a matrix row by row the column-major format is more appropriate.
		For three- and more-dimensional arrays there are even more combinations of possible array orderings.

		For these reasons, we developed a data structure which allows to configure the indexing of multidimensional data and thus optimize the data structure for given algorithm and hardware architecture.
		The data structure was first proposed in \cite{klinkovsky:2017thesis} and its implementation based on the \CPPfourteen standard was later integrated into the TNL library \cite{oberhuber:2021tnl} and further developed therein.
		The TNL project's documentation (\url{https://tnl-project.org/documentation/})\todo{cite as Online Resource?} provides an overview of the public interface and examples showing how the data structure can be used.
		In the following subsections, we describe the \emph{distributed} version of the data structure, which is especially useful for the algorithms presented in \cref{chapter:numerical methods}.

		\subsection{Distributed multidimensional array}
		In the \emph{distributed} configuration, a global multidimensional array is decomposed into several subarrays and each MPI rank typically stores the data associated to one subarray.
		Since a multidimensional array stores structured data, we will consider only \emph{structured conforming} decompositions, where the array is split by hyperplanes perpendicular to one of the axes.
		\Cref{fig:ndarray decomposition} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays.

		\begin{figure}[bt]
		\centering
		\includegraphics[scale=1.0]{./figures/ndarray_decomposition/2D_without_overlaps.pdf}
		\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays assigned to ranks 0-8.}
		\label{fig:ndarray decomposition}
		\end{figure}

		In order to represent a distributed multidimensional array, each MPI rank needs to have the following variables:
		\begin{itemize}
		\item \ic{localArray} -- an instance of a multidimensional array which contains data and indexing attributes of the subarray assigned to the rank.
		\item \ic{communicator} -- the MPI communicator comprising all ranks that have a subarray of the global multidimensional array.
		\item \ic{globalSizes} -- a multi-index determining the sizes of the global multidimensional array in terms of the number of elements in each dimension.
		\item \ic{localBegins} and \ic{localEnds} -- two multi-indices determining the beginning and ending position of the local subarray in the global array.
		According to the convention used in TNL, the local subarray spans the multidimensional interval $[\ic{localBegins}, \ic{localEnds})$.
		\end{itemize}
		Although these attributes of a distributed multidimensional array can be created separately and managed manually, TNL provides a convenient data structure \ic{TNL::Containers::DistributedNDArray} that provides a high-level interface to manage these attributes.
		The TNL project's documentation (\url{https://tnl-project.org/documentation/})\todo{cite as Online Resource?} provides an overview of the public interface and examples showing how the data structure can be used.\todo{write the documentation for \ic{DistributedNDArray}}

		\subsection{Operations}
		Algorithms involving a distributed multidimensional array may perform different operations on the data structure.
		Common operations may be classified as follows:
		\begin{itemize}
		\item
		\emph{Local operations} are the simplest type of operations that can be performed on each MPI rank independently of the other ranks.
		An example is the \ic{setValue} member function of \ic{DistributedNDArray} which sets all elements of the array to a constant value.
		\item
		\emph{Collective operations} involve some kind of communication among all ranks in the MPI communicator.
		An example is the \ic{operator==} function for the comparison between two instances of \ic{DistributedNDArray}, which involves a local operation (comparison of two local arrays) followed by a collective \ic{AND}-reduction of the local results among all ranks.
		\item
		\emph{Partially-collective operations} involve communication not on the global communicator, but among smaller groups of MPI ranks.
		It is often convenient to create MPI sub-communicators using the \ic{MPI_Comm_split} function \todo{describe in \cref{chapter:programming and architectures} and reference it here} to simplify the communication pattern definition in the program.
		For example, computing the sums of array elements per row involves the communication only among ranks containing the same rows.
		Another example are stencil computations on regular grids, which typically involve data exchange among neighboring ranks.
		\end{itemize}
		Additionally, complex algorithms may involve data exchange prior to performing the local operation (such as the input vector re-distribution in the distributed matrix--vector multiplication).
		General description of such algorithms is out of scope of this work.
		In the following subsection, we focus on data exchange in stencil computations, which are common in numerical analysis.

		\subsection{Overlaps and data synchronization}
		Stencils in numerical analysis are geometric arrangements of nodes on a structured grid that express which data is accessed from a grid node of interest and used in the numerical computation.
		The application of stencils on the data stored in a non-distributed multidimensional array is straightforward as all data is readily available in the memory of one rank.
		However, in a distributed configuration, the multidimensional array needs to be extended with \emph{overlaps} to facilitate access to data owned by neighboring ranks, and \emph{data synchronization} to exchange data in the overlapping regions of the array when the stencil is applied iteratively.

		\Cref{fig:ndarray decomposition overlaps} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps highlighted using dotted lines.
		The width of the overlaps in the TNL implementation is configurable separately for each dimension of the array.
		I.e., the overlaps can be described by a multi-index \ic{overlaps} where each component is non-negative and determines the number of data elements along the corresponding axis shared between neighboring ranks.
		Each rank includes the overlapping regions in its local array which is allocated larger, i.e., it spans the multidimensional interval $[\ic{localBegins}-\ic{overlaps}, \ic{localEnds}+\ic{overlaps})$.
		Note that negative indices may be used to access the local data in the overlapping region below \ic{localBegins} and they are mapped properly to the one-dimensional storage index.
		Also note that the array always includes the so called \emph{periodic overlaps} in the allocation, i.e., overlapping regions around the whole array.
		This allows to easily implement the periodic boundary conditions in numerical schemes as it ensures that the stencil can be applied on all local nodes.

		\begin{figure}[bt]
		\centering
		\includegraphics[scale=1.0]{./figures/ndarray_decomposition/2D_with_overlaps.pdf}
		\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps between neighbors (highlighted using dotted lines).}
		\label{fig:ndarray decomposition overlaps}
		\end{figure}

		The data synchronization for the overlaps in distributed multidimensional arrays in TNL is implemented in the class \ic{TNL::Containers::DistributedNDArraySynchronizer}.
		Before the algorithm can be started, each MPI rank must configure the synchronizer:
		\begin{itemize}
		\item
		The rank IDs of the \emph{neighbors} relevant for the stencil must be set.
		In the example shown in \cref{fig:ndarray decomposition overlaps}, rank 4 would set ranks 1, 3, 5, and 7 as its neighbors for the five-point stencil, and ranks 0, 1, 2, 3, 5, 6, 7, and 8 for the nine-point stencil.
		\item
		The \emph{directions} in which the data can be transferred must be set depending on the stencil.
		In the simplest case, all values of the array can be transferred in all directions, but some applications (such as the lattice Boltzmann method described in \cref{sec:LBM}) may use separate arrays for different synchronization directions (e.g., left-to-right or right-to-left).
		\item
		The \emph{tags} for MPI messages can be set to avoid conflicts when multiple arrays distributed among the same ranks are synchronized at the same time.
		\end{itemize}
		The synchronization algorithm can be executed in several modes.
		The asynchronous or non-blocking mode allows to interleave the synchronization with some other, unrelated work by suspending the execution when all non-blocking MPI requests have been created and deferring the remaining steps for later.
		Assuming that the MPI implementation can proceed with the communication in the background, this approach can greatly improve the efficiency of the distributed algorithm.
		The synchronization procedure consists of the following general steps:
		\begin{enumerate}
		\item Allocate all send and receive buffers.
		\item Copy data from the local array to the send buffers.
		\item Start MPI communications with all neighbors via \ic{MPI_Isend} and \ic{MPI_Irecv}.
		\item Return a vector of MPI requests (the non-blocking procedure is suspended after this step).
		\item Wait for all MPI requests to complete.
		\item Copy data from the receive buffers to the local array.
		\end{enumerate}
		Some of the aforementioned steps are not always necessary.
		For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.

		In general, to avoid the number of MPI requests, the data to be sent must be copied into a contiguous buffer and the received data must be copied from a contiguous buffer into the local array.
		In some special cases depending on the synchronization directions and the layout of the array, it can be ensured that all data to be sent or received is already stored in a contiguous block of the array, allowing the copies to be omitted by replacing the buffers with views into the local array itself.
		These cases are difficult to detect automatically and must be configured manually by the user.
		\input{content/ndarrays.tex}

		\section{Sparse matrices}
		\label{sec:sparse matrices}

content/ndarrays.tex

0 → 100644

+110 −0

Original line number	Diff line number	Diff line
		\inline{Podrobnosti bych vynechal, je to popsané v diplomce a na doktorátu jsem s tím nic moc nedělal... (pouze distribuovanou verzi -- což se ale velmi hodí pro LBM i s tím synchronizerem)}
		\inline{Comparison with other libraries}

		Many algorithms in scientific computing work with coefficients indexed by three, four or even more indices and multidimensional arrays are a natural data structure for representing such values in the computer memory.
		Since the \C++ language supports only one-dimensional arrays natively, multidimensional arrays have to be implemented explicitly (e.g., in a library) based on a mapping of multidimensional data to an internal one-dimensional array.

		An interesting problem is how to choose the mapping from the multidimensional index space into the one-dimensional index space.
		Even for two-dimensional arrays (i.e., matrices) it can be argued whether they should be stored in the \emph{row-major} format, where rows are stored as 1D arrays, or in the \emph{column-major} format, where columns are stored as 1D arrays.
		The optimal choice depends on the operations that we want to do with the data, as well as on the hardware architecture that will process the data.
		For example, the row-major format is suitable for algorithms processing a matrix row by row on the CPU, while for GPU algorithms also processing a matrix row by row the column-major format is more appropriate.
		For three- and more-dimensional arrays there are even more combinations of possible array orderings.

		For these reasons, we developed a data structure which allows to configure the indexing of multidimensional data and thus optimize the data structure for given algorithm and hardware architecture.
		The data structure was first proposed in \cite{klinkovsky:2017thesis} and its implementation based on the \CPPfourteen standard was later integrated into the TNL library \cite{oberhuber:2021tnl} and further developed therein.
		The TNL project's documentation (\url{https://tnl-project.org/documentation/})\todo{cite as Online Resource?} provides an overview of the public interface and examples showing how the data structure can be used.
		In the following subsections, we describe the \emph{distributed} version of the data structure, which is especially useful for the algorithms presented in \cref{chapter:numerical methods}.

		\subsection{Distributed multidimensional array}
		In the \emph{distributed} configuration, a global multidimensional array is decomposed into several subarrays and each MPI rank typically stores the data associated to one subarray.
		Since a multidimensional array stores structured data, we will consider only \emph{structured conforming} decompositions, where the array is split by hyperplanes perpendicular to one of the axes.
		\Cref{fig:ndarray decomposition} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays.

		\begin{figure}[bt]
		\centering
		\includegraphics[scale=1.0]{./figures/ndarray_decomposition/2D_without_overlaps.pdf}
		\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays assigned to ranks 0-8.}
		\label{fig:ndarray decomposition}
		\end{figure}

		In order to represent a distributed multidimensional array, each MPI rank needs to have the following variables:
		\begin{itemize}
		\item \ic{localArray} -- an instance of a multidimensional array which contains data and indexing attributes of the subarray assigned to the rank.
		\item \ic{communicator} -- the MPI communicator comprising all ranks that have a subarray of the global multidimensional array.
		\item \ic{globalSizes} -- a multi-index determining the sizes of the global multidimensional array in terms of the number of elements in each dimension.
		\item \ic{localBegins} and \ic{localEnds} -- two multi-indices determining the beginning and ending position of the local subarray in the global array.
		According to the convention used in TNL, the local subarray spans the multidimensional interval $[\ic{localBegins}, \ic{localEnds})$.
		\end{itemize}
		Although these attributes of a distributed multidimensional array can be created separately and managed manually, TNL provides a convenient data structure \ic{TNL::Containers::DistributedNDArray} that provides a high-level interface to manage these attributes.
		The TNL project's documentation (\url{https://tnl-project.org/documentation/})\todo{cite as Online Resource?} provides an overview of the public interface and examples showing how the data structure can be used.\todo{write the documentation for \ic{DistributedNDArray}}

		\subsection{Operations}
		Algorithms involving a distributed multidimensional array may perform different operations on the data structure.
		Common operations may be classified as follows:
		\begin{itemize}
		\item
		\emph{Local operations} are the simplest type of operations that can be performed on each MPI rank independently of the other ranks.
		An example is the \ic{setValue} member function of \ic{DistributedNDArray} which sets all elements of the array to a constant value.
		\item
		\emph{Collective operations} involve some kind of communication among all ranks in the MPI communicator.
		An example is the \ic{operator==} function for the comparison between two instances of \ic{DistributedNDArray}, which involves a local operation (comparison of two local arrays) followed by a collective \ic{AND}-reduction of the local results among all ranks.
		\item
		\emph{Partially-collective operations} involve communication not on the global communicator, but among smaller groups of MPI ranks.
		It is often convenient to create MPI sub-communicators using the \ic{MPI_Comm_split} function \todo{describe in \cref{chapter:programming and architectures} and reference it here} to simplify the communication pattern definition in the program.
		For example, computing the sums of array elements per row involves the communication only among ranks containing the same rows.
		Another example are stencil computations on regular grids, which typically involve data exchange among neighboring ranks.
		\end{itemize}
		Additionally, complex algorithms may involve data exchange prior to performing the local operation (such as the input vector re-distribution in the distributed matrix--vector multiplication).
		General description of such algorithms is out of scope of this work.
		In the following subsection, we focus on data exchange in stencil computations, which are common in numerical analysis.

		\subsection{Overlaps and data synchronization}
		Stencils in numerical analysis are geometric arrangements of nodes on a structured grid that express which data is accessed from a grid node of interest and used in the numerical computation.
		The application of stencils on the data stored in a non-distributed multidimensional array is straightforward as all data is readily available in the memory of one rank.
		However, in a distributed configuration, the multidimensional array needs to be extended with \emph{overlaps} to facilitate access to data owned by neighboring ranks, and \emph{data synchronization} to exchange data in the overlapping regions of the array when the stencil is applied iteratively.

		\Cref{fig:ndarray decomposition overlaps} shows a typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps highlighted using dotted lines.
		The width of the overlaps in the TNL implementation is configurable separately for each dimension of the array.
		I.e., the overlaps can be described by a multi-index \ic{overlaps} where each component is non-negative and determines the number of data elements along the corresponding axis shared between neighboring ranks.
		Each rank includes the overlapping regions in its local array which is allocated larger, i.e., it spans the multidimensional interval $[\ic{localBegins}-\ic{overlaps}, \ic{localEnds}+\ic{overlaps})$.
		Note that negative indices may be used to access the local data in the overlapping region below \ic{localBegins} and they are mapped properly to the one-dimensional storage index.
		Also note that the array always includes the so called \emph{periodic overlaps} in the allocation, i.e., overlapping regions around the whole array.
		This allows to easily implement the periodic boundary conditions in numerical schemes as it ensures that the stencil can be applied on all local nodes.

		\begin{figure}[bt]
		\centering
		\includegraphics[scale=1.0]{./figures/ndarray_decomposition/2D_with_overlaps.pdf}
		\caption{Typical two-dimensional decomposition of a two-dimensional array into 9 subarrays with overlaps between neighbors (highlighted using dotted lines).}
		\label{fig:ndarray decomposition overlaps}
		\end{figure}

		The data synchronization for the overlaps in distributed multidimensional arrays in TNL is implemented in the class \ic{TNL::Containers::DistributedNDArraySynchronizer}.
		Before the algorithm can be started, each MPI rank must configure the synchronizer:
		\begin{itemize}
		\item
		The rank IDs of the \emph{neighbors} relevant for the stencil must be set.
		In the example shown in \cref{fig:ndarray decomposition overlaps}, rank 4 would set ranks 1, 3, 5, and 7 as its neighbors for the five-point stencil, and ranks 0, 1, 2, 3, 5, 6, 7, and 8 for the nine-point stencil.
		\item
		The \emph{directions} in which the data can be transferred must be set depending on the stencil.
		In the simplest case, all values of the array can be transferred in all directions, but some applications (such as the lattice Boltzmann method described in \cref{sec:LBM}) may use separate arrays for different synchronization directions (e.g., left-to-right or right-to-left).
		\item
		The \emph{tags} for MPI messages can be set to avoid conflicts when multiple arrays distributed among the same ranks are synchronized at the same time.
		\end{itemize}
		The synchronization algorithm can be executed in several modes.
		The asynchronous or non-blocking mode allows to interleave the synchronization with some other, unrelated work by suspending the execution when all non-blocking MPI requests have been created and deferring the remaining steps for later.
		Assuming that the MPI implementation can proceed with the communication in the background, this approach can greatly improve the efficiency of the distributed algorithm.
		The synchronization procedure consists of the following general steps:
		\begin{enumerate}
		\item Allocate all send and receive buffers.
		\item Copy data from the local array to the send buffers.
		\item Start MPI communications with all neighbors via \ic{MPI_Isend} and \ic{MPI_Irecv}.
		\item Return a vector of MPI requests (the non-blocking procedure is suspended after this step).
		\item Wait for all MPI requests to complete.
		\item Copy data from the receive buffers to the local array.
		\end{enumerate}
		Some of the aforementioned steps are not always necessary.
		For example, the allocated buffers can be reused when the synchronizer is used repeatedly on arrays of the same size.

		In general, to avoid the number of MPI requests, the data to be sent must be copied into a contiguous buffer and the received data must be copied from a contiguous buffer into the local array.
		In some special cases depending on the synchronization directions and the layout of the array, it can be ensured that all data to be sent or received is already stored in a contiguous block of the array, allowing the copies to be omitted by replacing the buffers with views into the local array itself.
		These cases are difficult to detect automatically and must be configured manually by the user.