Commit 55fa42fb authored by Jakub Klinkovský's avatar Jakub Klinkovský
Browse files

doc: add job submission

parent 54677f46
Loading
Loading
Loading
Loading
+1 −0
Original line number Original line Diff line number Diff line
@@ -21,3 +21,4 @@ You can also follow the documentation below to find more details.
## Documentation
## Documentation


- [Hardware overview](./doc/hardware-overview.md)
- [Hardware overview](./doc/hardware-overview.md)
- [Job submission](./doc/jobs.md)

doc/jobs.md

0 → 100644
+222 −0
Original line number Original line Diff line number Diff line
# Job submission

The job scheduler and workload manager system used is [Slurm](https://slurm.schedmd.com/).
The upstream documentation has an excellent [Quick Start User Guide](https://slurm.schedmd.com/quickstart.html).

A very basic overview and examples of scripts configured directly for the GPU cluster are provided below for convenience.

_Table of contents_:

[[_TOC_]]

## Basic commands

- `sinfo` reports the state of partitions and nodes managed by Slurm.
- `squeue` reports the state of running and pending jobs or job steps.
- `sbatch` is used to submit a job script for later execution.
- `srun` is used to submit a job for execution or initiate job steps in real time.
- `scancel` is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

## Submitting a job

Batch jobs can be submitted by creating a shell script containing `#SBATCH` directives specifying various options for Slurm.
Not many options are needed for a basic serial job:

```bash
#!/bin/bash

# job name (default is the name of this file)
#SBATCH --job-name=example
# file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --output=log.%x.job_%j
# maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --time=0:01:00

#SBATCH --partition=gpXY        # partition/queue name for the job submission
#SBATCH --ntasks=1              # number of tasks/processes

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# run the computation
echo "hello world"
```

If you saved the script as `example.sh`, you can submit it with `sbatch example.sh`.
When the job has been started on a compute node, the log file `log.example.job_<jobid>` will be created based on the `--output` option, where you can see incremental output of the computation.

The complete list of the available options/directives can be found in the [sbatch(1)](https://slurm.schedmd.com/sbatch.html) manual page.
The following sections provide more advanced (and useful) examples.

### Email notification

You can add the `--mail-type` and `--mail-user` options for `sbatch` to request email notifications about the events related to your job.

```bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@example.com
```

The following event types are supported:

| Event type     | Description |
|----------------|-------------|
| BEGIN          | Job started |
| END            | Job finished |
| FAIL           | Job failed |
| REQUEUE        | Job was requeued |
| INVALID_DEPEND | Dependency never satisfied |
| STAGE_OUT      | Burst buffer stage out and teardown completed |
| ALL            | Equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE, and STAGE_OUT |
| TIME_LIMIT     | Job reached its time limit |
| TIME_LIMIT_90  | Reached 90 percent of time limit |
| TIME_LIMIT_80  | Reached 80 percent of time limit |
| TIME_LIMIT_50  | Reached 50 percent of time limit |
| ARRAY_TASKS    | Send emails for each array task |

Multiple type values may be specified in a comma separated list.

## Allocating resources (CPUs, GPUs, RAM)

When a job is added to the queue, it waits until the requested _consumable resources_ are available and can be allocated to the job.
The consumable resources configured on the GPU cluster are CPU cores, memory (RAM) and GPUs.
Note that hyper-threading is enabled on compute nodes, so each physical CPU core has 2 virtual CPUs (hyperthreads, logical cores).

### CPUs

By default, each task is allocated on its own _physical_ CPU core and can use both of its _virtual_ CPUs.
Different jobs cannot share a physical core, even if each uses only one virtual CPU.

The following options can be used to adjust the resource allocation:

| Option                        | Description |
|-------------------------------|-------------|
| `--cpus-per-task=<number>`    | Allocate `<number>` virtual CPUs per task/process. Cannot be used with `--cpus-per-gpu`. |
| `--cpus-per-gpu=<number>`     | Allocate `<number>` virtual CPUs per each allocated GPU (see below). Cannot be used with `--cpus-per-task`. |
| `--threads-per-core=<number>` | Restrict node selection to nodes with at least the specified number of hyperthreads per core. In the task layout, use the specified maximum number of hyperthreads per core. |
| `--hint=nomultithread`        | Do not consider hyperthreads in the task layout (effectively the same as `--threads-per-core=1`). |
| `--hint=multithread`          | Consider hyperthreads in the task layout (effectively the same as `--threads-per-core=2`). |

### GPUs

Jobs do not have access to the GPUs on compute nodes by default and users must explicitly request a specific number of GPUs for their jobs.
The following options are available:

| Option                        | Description |
|-------------------------------|-------------|
| `--gpus=<number>`             | Total number of GPUs required for the job. |
| `--gpus-per-task=<number>`    | Number of GPUs required for each task/process of the job. |
| `--gpus-per-socket=<number>`  | Number of GPUs required for the job on each socket. |
| `--gpus-per-node=<number>`    | Number of GPUs required for the job on each node. |

When running MPI jobs (which will be explained later), all processes running on the same node have access to all GPUs allocated to the job on this node.
The following option can be used to bind each process to its own GPU:

```bash
# bind each process to its own GPU (single:<tasks_per_gpu>)
#SBATCH --gpu-bind=single:1
```

Binding is done using the `CUDA_VISIBLE_DEVICES` environment variable.
For example, executing a job with `--ntasks-per-node=2`, `--gpus-per-task=1` and `--gpu-bind=single:1` will let each process see either `CUDA_VISIBLE_DEVICES=0` or `CUDA_VISIBLE_DEVICES=1` instead of `CUDA_VISIBLE_DEVICES=0,1`.

### RAM

System memory can be allocated as a consumable resource using the `--mem` option:

```bash
# how much RAM per node can be allocated for the job (default: 2000M, max: 15000M)
#SBATCH --mem=10G
```

__Note:__ The `--mem-per-cpu` and `--mem-per-gpu` options are not applicable due to the cluster configuration (`DefMemPerNode` in `slurm.conf`).

## Examples

### OpenMP jobs

Typical OpenMP jobs involve allocating `N` CPU cores per task and using `N` OpenMP threads.
This can be achieved by exporting the `OMP_NUM_THREADS` according to the Slurm configuration:

```bash
#!/bin/bash

# job name (default is the name of this file)
#SBATCH --job-name=example-openmp
# file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --output=log.%x.job_%j
# maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --time=0:01:00

#SBATCH --partition=gpXY        # partition/queue name for the job submission
#SBATCH --ntasks=1              # number of tasks/processes

#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=8       # number of CPUs per process

# how much RAM per node can be allocated for the job (default: 2000M, max: 15000M)
#SBATCH --mem=10G

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# set the number of threads for OpenMP
export OMP_NUM_THREADS="$SLURM_CPUS_PER_TASK"

# run the computation
./my_solver
```

### MPI jobs

For MPI jobs you need to set `--ntasks` to a value larger than 1.
Alternatively, if you want more control of task layout, set `--nodes` and `--ntasks-per-node` instead.

Use `srun` or `mpirun` in the script to launch a computation across multiple CPUs or nodes.
For example:

```bash
#!/bin/bash

# job name (default is the name of this file)
#SBATCH --job-name=example-mpi
# file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --output=log.%x.job_%j
# maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --time=0:01:00

#SBATCH --partition=gpXY        # partition/queue name for the job submission

#SBATCH --nodes=2               # number of nodes
#SBATCH --ntasks-per-node=2     # MPI processes per node

#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=2       # number of CPUs per process
#SBATCH --gpus-per-task=1       # number of GPUs per process
#SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

# how much RAM per node can be allocated for the job (default: 2000M, max: 15000M)
#SBATCH --mem=10G

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# run the computation
srun ./my_solver
```

## Job arrays

TODO: https://slurm.schedmd.com/job_array.html

## Interactive jobs

TODO: `srun --pty`

## Managing jobs

TODO:
- monitoring jobs - https://hpc.nih.gov/docs/userguide.html#monitor
- deleting jobs - https://hpc.nih.gov/docs/userguide.html#delete
- job states - https://hpc.nih.gov/docs/userguide.html#states
- modifying jobs after submission - https://hpc.nih.gov/docs/userguide.html#modify_job