A slurm GPU test program
Python code by Franco Rugolon, slurm wrappers by Erik Thuning
This is a small set of scripts intended to both validate function of a gpu-enabled slurm cluster and serve as some minimal instructions on how to get up and running with a project in slurm.
Getting started
All the following steps are to be executed on a slurm cluster.
The scripts should be runnable unmodified on DSV's Olympus
cluster. The --partition flag used in these scripts will probably
need to change to something different if attempting to run on
another cluster.
-
Clone the repository
$ git clone https://gitea.dsv.su.se/erth9960/slurm-gpu-test.git -
Move into the cloned repository
$ cd slurm-gpu-test -
Set up your python virtual environment
$ sbatch slurm-setup.sh -
Run the test program
$ sbatch slurm-run.sh
If all of the steps execute successfully, the cluster can properly access GPUs.
Scripts
validate.py
A very small python script that simply prints information on GPU availability.
synth-data-gen.py
The program used to test GPU functionality. It uses torch to generate some synthetic data and needs a GPU in order to do so efficiently.
setup-env.sh
Sets up the python virtual environment that the program needs in order
to function. pip is invoked with --system-site-packages in order
to avoid having to install pytorch specifically for this project,
because pytorch is a very large library.
run-job.sh
Activates the virtual environment and runs the python programs.
slurm-setup.sh
Calls setup-env.sh as a slurm job. This is run as a slurm job to
ensure that the virtual environment is created under the same
circumstances that it is going to be use in. This minimises the risk
of strange errors due to version mismatches etc.
Note the --account flag, it ensures that your job is run with higher
limits than the fairly restrictive defaults. Using it here probably
isn't necessary, since library installation should be fairly quick.
slurm-run.sh
Calls run-job.sh as a slurm job. This is how synth-data-gen.py
actually gets run on the GPU computing nodes.
Note the --account flag, it ensures that your job is run with higher
limits than the fairly restrictive defaults. It's important to set the
flag correctly when running your code so that it doesn't suddenly get
interrupted due to resource limits.
Useful slurm commands
In addition to sbatch that occurs in the scripts, these are some
useful slurm commands to inspect the state of the slurm cluster or
specific jobs:
-
sinfo
Shows a short status summary of the cluster as a whole. -
squeue
Shows any currently executing jobs and jobs queued for execution. -
scontrol show job NN
Shows detailed information on the running or recently terminated job with IDNN. -
sacct
Can be queried for information on all jobs ever run. Not trivial to call properly.
All these commands have documentation available by running man <program> (e.g. man sinfo). The same documentation is also
available on the internet
here.