# A slurm GPU test program

Python code by Franco Rugolon, slurm wrappers by Erik Thuning

This is a small set of scripts intended to both validate function of a
gpu-enabled slurm cluster and serve as some minimal instructions on
how to get up and running with a project in slurm.


## Getting started

All the following steps are to be executed on a slurm cluster.

The scripts should be runnable unmodified on DSV's Olympus
cluster. The `--partition` flag used in these scripts will probably
need to change to something different if attempting to run on
another cluster.

 1. Clone the repository  
    `$ git clone https://gitea.dsv.su.se/erth9960/slurm-gpu-test.git`

 1. Move into the cloned repository  
    `$ cd slurm-gpu-test`

 1. Set up your python virtual environment  
    `$ sbatch slurm-setup.sh`

 1. Run the test program  
    `$ sbatch slurm-run.sh`

If all of the steps execute successfully, the cluster can properly
access GPUs.

## Scripts
### validate.py
A very small python script that simply prints information on
GPU availability.


### synth-data-gen.py
The program used to test GPU functionality. It uses torch to generate
some synthetic data and needs a GPU in order to do so efficiently.

### setup-env.sh
Sets up the python virtual environment that the program needs in order
to function. pip is invoked with `--system-site-packages` in order
to avoid having to install pytorch specifically for this project,
because pytorch is a very large library.

### run-job.sh
Activates the virtual environment and runs the python programs.

### slurm-setup.sh
Calls `setup-env.sh` as a slurm job. This is run as a slurm job to
ensure that the virtual environment is created under the same
circumstances that it is going to be use in. This minimises the risk
of strange errors due to version mismatches etc.  
Note the `--account` flag, it ensures that your job is run with higher
limits than the fairly restrictive defaults. Using it here _probably_
isn't necessary, since library installation should be fairly quick.

### slurm-run.sh
Calls `run-job.sh` as a slurm job. This is how `synth-data-gen.py`
actually gets run on the GPU computing nodes.  
Note the `--account` flag, it ensures that your job is run with higher
limits than the fairly restrictive defaults. It's important to set the
flag correctly when running your code so that it doesn't suddenly get
interrupted due to resource limits.


## Useful slurm commands

In addition to `sbatch` that occurs in the scripts, these are some
useful slurm commands to inspect the state of the slurm cluster or
specific jobs:

 * `sinfo`  
   Shows a short status summary of the cluster as a whole.

 * `squeue`  
   Shows any currently executing jobs and jobs queued for execution.

 * `scontrol show job NN`  
   Shows detailed information on the running or recently terminated
   job with ID `NN`.

 * `sacct`  
   Can be queried for information on all jobs ever run. Not trivial to
   call properly.
 
All these commands have documentation available by running `man
<program>` (e.g. `man sinfo`). The same documentation is also
available on the internet
[here](https://slurm.schedmd.com/man_index.html).