# A slurm GPU test program Python code by Franco Rugolon, slurm wrappers by Erik Thuning This is a small set of scripts intended to both validate function of a gpu-enabled slurm cluster and serve as some minimal instructions on how to get up and running with a project in slurm. ## Getting started All the following steps are to be executed on a slurm cluster. The scripts should be runnable unmodified on DSV's Olympus cluster. The `--partition` flag used in these scripts will probably need to change to something different if attempting to run on another cluster. 1. Clone the repository `$ git clone https://gitea.dsv.su.se/erth9960/slurm-gpu-test.git` 1. Move into the cloned repository `$ cd slurm-gpu-test` 1. Set up your python virtual environment `$ sbatch slurm-setup.sh` 1. Run the test program `$ sbatch slurm-run.sh` If all of the steps execute successfully, the cluster can properly access GPUs. ## Scripts ### validate.py A very small python script that simply prints information on GPU availability. ### synth-data-gen.py The program used to test GPU functionality. It uses torch to generate some synthetic data and needs a GPU in order to do so efficiently. ### setup-env.sh Sets up the python virtual environment that the program needs in order to function. pip is invoked with `--system-site-packages` in order to avoid having to install pytorch specifically for this project, because pytorch is a very large library. ### run-job.sh Activates the virtual environment and runs the python programs. ### slurm-setup.sh Calls `setup-env.sh` as a slurm job. This is run as a slurm job to ensure that the virtual environment is created under the same circumstances that it is going to be use in. This minimises the risk of strange errors due to version mismatches etc. Note the `--account` flag, it ensures that your job is run with higher limits than the fairly restrictive defaults. Using it here _probably_ isn't necessary, since library installation should be fairly quick. ### slurm-run.sh Calls `run-job.sh` as a slurm job. This is how `synth-data-gen.py` actually gets run on the GPU computing nodes. Note the `--account` flag, it ensures that your job is run with higher limits than the fairly restrictive defaults. It's important to set the flag correctly when running your code so that it doesn't suddenly get interrupted due to resource limits. ## Useful slurm commands In addition to `sbatch` that occurs in the scripts, these are some useful slurm commands to inspect the state of the slurm cluster or specific jobs: * `sinfo` Shows a short status summary of the cluster as a whole. * `squeue` Shows any currently executing jobs and jobs queued for execution. * `scontrol show job NN` Shows detailed information on the running or recently terminated job with ID `NN`. * `sacct` Can be queried for information on all jobs ever run. Not trivial to call properly. All these commands have documentation available by running `man ` (e.g. `man sinfo`). The same documentation is also available on the internet [here](https://slurm.schedmd.com/man_index.html).