erth9960/slurm-gpu-test

Fork 0

T

erth9960 c0e966217f Update README.md

2026-03-24 10:36:51 +01:00

.gitignore

Initial commit

2026-03-04 18:52:02 +01:00

README.md

Update README.md

2026-03-24 10:36:51 +01:00

requirements.txt

Initial commit

2026-03-04 18:52:02 +01:00

run-job.sh

Added README

2026-03-04 19:30:27 +01:00

setup-env.sh

Initial commit

2026-03-04 18:52:02 +01:00

slurm-run.sh

Added an --account flag to ensure correct runtime limits

2026-03-24 10:27:12 +01:00

slurm-setup.sh

Added an --account flag to ensure correct runtime limits

2026-03-24 10:33:10 +01:00

synth-data-gen.py

Added README

2026-03-04 19:30:27 +01:00

validate.py

Initial commit

2026-03-04 18:52:02 +01:00

README.md

A slurm GPU test program

Python code by Franco Rugolon, slurm wrappers by Erik Thuning

This is a small set of scripts intended to both validate function of a gpu-enabled slurm cluster and serve as some minimal instructions on how to get up and running with a project in slurm.

Getting started

All the following steps are to be executed on a slurm cluster.

The scripts should be runnable unmodified on DSV's Olympus cluster. The --partition flag used in these scripts will probably need to change to something different if attempting to run on another cluster.

Clone the repository
$ git clone https://gitea.dsv.su.se/erth9960/slurm-gpu-test.git
Move into the cloned repository
$ cd slurm-gpu-test
Set up your python virtual environment
$ sbatch slurm-setup.sh
Run the test program
$ sbatch slurm-run.sh

If all of the steps execute successfully, the cluster can properly access GPUs.

Scripts

validate.py

A very small python script that simply prints information on GPU availability.

synth-data-gen.py

The program used to test GPU functionality. It uses torch to generate some synthetic data and needs a GPU in order to do so efficiently.

setup-env.sh

Sets up the python virtual environment that the program needs in order to function. pip is invoked with --system-site-packages in order to avoid having to install pytorch specifically for this project, because pytorch is a very large library.

run-job.sh

Activates the virtual environment and runs the python programs.

slurm-setup.sh

Calls setup-env.sh as a slurm job. This is run as a slurm job to ensure that the virtual environment is created under the same circumstances that it is going to be use in. This minimises the risk of strange errors due to version mismatches etc.
Note the --account flag, it ensures that your job is run with higher limits than the fairly restrictive defaults. Using it here probably isn't necessary, since library installation should be fairly quick.

slurm-run.sh

Calls run-job.sh as a slurm job. This is how synth-data-gen.py actually gets run on the GPU computing nodes.
Note the --account flag, it ensures that your job is run with higher limits than the fairly restrictive defaults. It's important to set the flag correctly when running your code so that it doesn't suddenly get interrupted due to resource limits.

Useful slurm commands

In addition to sbatch that occurs in the scripts, these are some useful slurm commands to inspect the state of the slurm cluster or specific jobs:

sinfo
Shows a short status summary of the cluster as a whole.
squeue
Shows any currently executing jobs and jobs queued for execution.
scontrol show job NN
Shows detailed information on the running or recently terminated job with ID NN.
sacct
Can be queried for information on all jobs ever run. Not trivial to call properly.

All these commands have documentation available by running man <program> (e.g. man sinfo). The same documentation is also available on the internet here.