Help:Slurm
The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux
- Software homepage
- https://slurm.schedmd.com
- Software availability
- newer clusters
- Other related software
- replaces sge
- command to type to run
- sinfo sbatch srun scancel squeue
- View online documentation
- https://slurm.schedmd.com/quickstart.html
- https://slurm.schedmd.com/tutorials.html
- https://arcc.ist.ucf.edu/index.php/help/tutorials/job-submission-on-stokes-with-slurm
- https://srcc.stanford.edu/sge-slurm-conversion
- SLURM Command Option Summary (cheat sheet)
- https://slurm.schedmd.com/pdfs/summary.pdf
- Location of example files
Examples[edit]
batch example[edit]
Sample Job script:
#!/bin/bash sleep 100
Make the script executable:
chmod +x testscript
Submit the script
sbatch testscript
example slurm script to use gpu[edit]
save the following as (for example) testjob.slurm
#!/bin/bash #SBATCH -p gpu #SBATCH --gres=gpu:1 #SBATCH -c 4 echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES python3 tens.py
then make it executable:
$ chmod +x testjob.slurm
Submit this job like this:
$ module load cuda cudnn $ sbatch testjob.slurm
Note: you can also specify gpu type, for example: (pick one)
#SBATCH --gres=gpu:pascal:1 #SBATCH --gres=gpu:kepler:1
Note the use of CUDA_VISIBLE_DEVICES -- printing the value of this may help with debugging. You can use this to match what gpus your job uses to the system performance graphs.
check slurm status and resource availablity[edit]
- sinfo
- get basic node status and partition list
- sinfo -O nodelist,partition,gres,features
- get list of available generic resources
Example:
$ sinfo LIMIT NODES STATE NODELIST normal* up infinite 4 idle c0-[1-4] gpu up infinite 1 idle c0-0 $ sinfo -O nodelist,partition,nodes,gres,features NODELIST GRES AVAIL_FEATURES c0-1 (null) zswap c0-0 gpu:kepler:2 gpu $ sinfo -O partition,nodelist,cpusstate,gres
In this example, there are two kepler gpus available (request with --gres=gpu:1) in the gpu partition (-p gpu)
As a simple example:
sbatch -p gpu --gres=gpu:1 --wrap="nvidia-smi"
or
srun -p gpu --gres=gpu:1 nvidia-smi
To check on running jobs:
squeue
To cancel a job,
scancel jobid
where jobid can be found in the output from squeue
To view information about completed jobs:
sacct
To view information about old jobs (for instance, jobs since Jan 2):
sacct --start=0102
To view all information about a specific job by number: (replace ### with job number)
sacct -j ### -o all -P | strans | less
The strans command (coupled with the sacct -p option) transposes the table to make it more readable. You can also try listing specific fields you want to see with sacct -o field,list or try spformat which separates out common fields and resizes column widths. (Also try spformat -fK to split into multiple small tables.)
For detailed information about a job that is either currently running or waiting to run, supply its jobid:
scontrol show job 123
interactive example[edit]
Note: interactive jobs are discouraged. Resources requested by interactive sessions are unavailable to other users until the session closes. Abuse of this will cause limits to be placed on interactive sessions.
srun --pty -I
or if you want to run bash (login shell) interactively on a node:
srun --pty -I bash
Slurm options[edit]
These can be put on the command line of sbatch or srun or added in your batch script prefixed with #SBATCH
This is a few of the interesting options, for a complete list, check the sbatch man page.
Please note: slurm options are somewhat sensitive to order!
- The partition option should be first ( -p gpu )
- QOS options should be next
- GPU requests should be after QOS
- In srun, the command to run should be last with all slurm options before it.
Options that select resource allocation permissions (such as partition, qos) need to be early in the options.
common options[edit]
- -c #
- allocate # cpus per task (default=1)
- -C features
- request a node with special features; run sinfo -O features for a list of available features.
- --exclusive
- request an exclusive node rather than sharing the node with other jobs
- --mem=#
- memory needed per node
- --mem-per-cpu=#
- provide a minimum amount of memory per cpu (most clusters default to 8G)
- --cpus-per-task=#
- request # cpus for the job task (Note: number can be a range)
- --nodes=#
- allocate at least # nodes for the job (but see below)
- -p partition
- run in a specific partition instead of the default partition (list partitions with sinfo; some clusters have a gpu partition)
- --wrap="command"
- wrap a command in a shell script instead of specifying the script
- -C 'feature|feature...'
- request features available on some nodes. see sinfo -O features for a cluster specific list of features available.
gpu use[edit]
Use these slurm options to request gpus. As with other options, they can either be put on the sbatch command line or in your script.
- --gres=gpu:1
- request one gpu
- --gres=gpu:pascal:1
- request one pascal gpu (see sinfo -O gres)
- -p gpu
- select the gpu partition (needed on clusters with both gpu nodes and cpu only nodes, check with sinfo )
Deprecated slurm options[edit]
We do not recommend the following options because the defaults work well:
- --cpus-per-gpu
- DO NOT USE THIS OPTION. IT DOESN'T WORK IN THE CURRENT VERSION OF SLURM!
- --output=filename_pattern
- The default is to use 'slurm-%j.out' or 'slurm-%A_%a.out' which includes the jobid in the output filename, which makes it easy to match errors to failed jobs and saves a separate output log for each run. If you do change this option, make sure that it results in a unique name for the job in a writable directory.
- --error=filename_pattern
- If this option is used, errors from jobs will be saved in a separate file. Usually it is easier to use the default, which saves errors with the job output.
- Because some users abused this option with large array jobs, causing excessive burden on the campus mail system, this option is disabled on most clusters. If you need external notification of job completion, talk to us and something can be arranged. Use this carefully on clusters that still have it enabled.
- --nodes=#
- This option is disabled on some clusters because it requires multiple node support within the code and some users inadvertently use it on jobs that only support single nodes. If you know your code works on multiple nodes and supports slurm host lists, contact us and it can be enabled for your account.
- --nodelist
- Please do not force slurm to use specific nodes. Slurm automatically picks the best node by default. Use of this option without compelling justification will cause it to be disabled. Please use feature lists (-C), or other resource requests instead. (Ask for the features you need and we'll add them.)
- --exclude=nodelist
- This option can be used if your job crashes on specific nodes; however slurm usually takes nodes offline when this occurs. If there is a problem with a node, please let us know ASAP so that it can be fixed rather than just excluding it in every job. If this feature is abused, it will be disabled.
Array jobs[edit]
Array jobs are a batch queue feature for embarrassingly parallel tasks where parallelization is trivial. For instance, if you need to run the same program on 100 different input files, you could create a script called myjob.sh containing:
myjob testcase-$SLURM_ARRAY_TASK_ID
and submit it like this:
sbatch -a 1-100 myjob.sh
The program myjob would be run repeatedly with $SLURM_ARRAY_TASK_ID replaced with the numbers 1, 2, 3, ... 100
If you wanted to skip numbers, you could do something like this:
sbatch -a 4-20:3 myjob.sh
which would run jobs with $SLURM_ARRAY_TASK_ID set to 4, 7, 10, 13, 16, 19
If you have a large list of files to process that are not numbered sequentially, you could save the list of files to a file and extract it like this:
% ls datadir > filelist.txt % wc -l filelist.txt 24532
(This list has 24532 files in it)
% sbatch -a 1-24532 processfiles.sh
In your job, you can extract the name like this:
#!/bin/bash taskfile=`sed -n "${SLURM_ARRAY_TASK_ID}p" filelist.txt` yourprogram $taskfile
If some of these jobs failed and you want to rerun them, you can use -a and list the tasks (and subranges of tasks) separated with commas.
Environment variables[edit]
For a complete list, check the sbatch man page.
- SLURM_JOB_ID
- The ID of the job allocation.
- SLURM_ARRAY_TASK_ID
- the current tasks's array index
- SLURM_RESTART_COUNT
- number of times this job has been restarted and requeued; use this to detect if you need to do something to restore a previous saved state
- SLURM_CPUS_PER_TASK
Diagnostics and error codes[edit]
Jobs may fail at different stages for various reasons.
If you get an error during job submission, either you have a syntax error in the parameters or you have asked for resources that will never be available. Ask for help if this is not obvious to you.
If your job is submitted successfully but then disappears from the queue, it probably finished (successfully or unsuccessfully) very quickly. Slurm records basic accounting data and directs application output and error messages to one or two log files (as per job options). If the job fails quickly and log files are not written, most likely reason for a job failure is that you tried to start it in a directory where the log file can't be written, or you are over disk quota.
You can use the sacct command to check the status and post mortem statistics of a job.
The State column indicates slurm errors. Sometimes the reason column gives more details. The Exit code column shows an application specific numeric error code.
$ sacct -j 123456 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 123456 myjob gpu group3 8 OUT_OF_ME+ 0:125 123456.batch batch group3 8 OUT_OF_ME+ 0:125 123456.exte+ extern group3 8 COMPLETED 0:0
or use one of the wrappers for sacct ( sacct-diag sacct-mem )
$ sacct-diag -j 123456 NodeList=c3-1 Start=2024-04-20T09:10:01 End=2024-04-20T11:47:08 Elapsed=02:37:07 ******************************************************************************* User JobID ExitCode State Reason JobName User JobID ExitC State Reas JobName ssd 123456 0:125 OUT_OF_MEMORY None myjob 123456.batch 0:125 OUT_OF_MEMORY batch 123456.extern 0:0 COMPLETED extern
$ sacct-mem -j 123456 JobID User State All ReqMem MaxVMSi NodeLi Start ------------ -------- ---------- --- ------- ------- ------ ------------------- 123456 ssd OUT_OF_ME+ 8 61.72G c2-1 2024-04-20T09:10:01 123456.batch OUT_OF_ME+ 8 61.14G c2-1 2024-04-20T09:10:01 123456.exte+ COMPLETED 8 0 c2-1 2024-04-20T09:10:01
Note that the MaxVMSize column is sampled, so it may not actually include the highest value.
Slurm displays the exit code separated into exit:signal. The exit value is generally application specific but a few common exit code meanings are: (These may or may not be relevant, since the application controls this.)
value | meaning |
---|---|
0 | success |
nonzero | failure, slurm will mark job as FAILED |
1 | general failure |
2 | incorrect use of shell builtin (bash) command |
125 | out of memory |
126 | command cannot execute (bash) |
127 | command not found (bash) |
128 | invalid argument to exit (bash) |
Any other exit code (and some of the above) may be applicatin specific. This is the value passed to exit()
by the application.
Some common signals that may cause an application to exit are listed here.
For a complete list of signals, see man -s 7 signal
.
2 | equivalent of ctrl-c |
6 | application detected critical error and called abort() |
8 | floating point error |
7 11 | memory access error (your code is buggy) |
9 15 | slurm probably killed this job; canceled by user or time expired? |
53 | Failed to write output file (check quota and directory permissions) |
125 | out of memory |