95 lines
3.1 KiB
Markdown
95 lines
3.1 KiB
Markdown
# A slurm GPU test program
|
|
|
|
Python code by Franco Rugolon, slurm wrappers by Erik Thuning
|
|
|
|
This is a small set of scripts intended to both validate function of a
|
|
gpu-enabled slurm cluster and serve as some minimal instructions on
|
|
how to get up and running with a project in slurm.
|
|
|
|
|
|
## Getting started
|
|
|
|
All the following steps are to be executed on a slurm cluster.
|
|
|
|
The scripts should be runnable unmodified on DSV's Olympus
|
|
cluster. The `--partition` flag used in these scripts will probably
|
|
need to change to something different if attempting to run on
|
|
another cluster.
|
|
|
|
1. Clone the repository
|
|
`$ git clone https://gitea.dsv.su.se/erth9960/slurm-gpu-test.git`
|
|
|
|
1. Move into the cloned repository
|
|
`$ cd slurm-gpu-test`
|
|
|
|
1. Set up your python virtual environment
|
|
`$ sbatch slurm-setup.sh`
|
|
|
|
1. Run the test program
|
|
`$ sbatch slurm-run.sh`
|
|
|
|
If all of the steps execute successfully, the cluster can properly
|
|
access GPUs.
|
|
|
|
## Scripts
|
|
### validate.py
|
|
A very small python script that simply prints information on
|
|
GPU availability.
|
|
|
|
|
|
### synth-data-gen.py
|
|
The program used to test GPU functionality. It uses torch to generate
|
|
some synthetic data and needs a GPU in order to do so efficiently.
|
|
|
|
### setup-env.sh
|
|
Sets up the python virtual environment that the program needs in order
|
|
to function. pip is invoked with `--system-site-packages` in order
|
|
to avoid having to install pytorch specifically for this project,
|
|
because pytorch is a very large library.
|
|
|
|
### run-job.sh
|
|
Activates the virtual environment and runs the python programs.
|
|
|
|
### slurm-setup.sh
|
|
Calls `setup-env.sh` as a slurm job. This is run as a slurm job to
|
|
ensure that the virtual environment is created under the same
|
|
circumstances that it is going to be use in. This minimises the risk
|
|
of strange errors due to version mismatches etc.
|
|
Note the `--account` flag, it ensures that your job is run with higher
|
|
limits than the fairly restrictive defaults. Using it here _probably_
|
|
isn't necessary, since library installation should be fairly quick.
|
|
|
|
### slurm-run.sh
|
|
Calls `run-job.sh` as a slurm job. This is how `synth-data-gen.py`
|
|
actually gets run on the GPU computing nodes.
|
|
Note the `--account` flag, it ensures that your job is run with higher
|
|
limits than the fairly restrictive defaults. It's important to set the
|
|
flag correctly when running your code so that it doesn't suddenly get
|
|
interrupted due to resource limits.
|
|
|
|
|
|
## Useful slurm commands
|
|
|
|
In addition to `sbatch` that occurs in the scripts, these are some
|
|
useful slurm commands to inspect the state of the slurm cluster or
|
|
specific jobs:
|
|
|
|
* `sinfo`
|
|
Shows a short status summary of the cluster as a whole.
|
|
|
|
* `squeue`
|
|
Shows any currently executing jobs and jobs queued for execution.
|
|
|
|
* `scontrol show job NN`
|
|
Shows detailed information on the running or recently terminated
|
|
job with ID `NN`.
|
|
|
|
* `sacct`
|
|
Can be queried for information on all jobs ever run. Not trivial to
|
|
call properly.
|
|
|
|
All these commands have documentation available by running `man
|
|
<program>` (e.g. `man sinfo`). The same documentation is also
|
|
available on the internet
|
|
[here](https://slurm.schedmd.com/man_index.html).
|