Help:Slurm

From CECS wiki
Jump to navigation Jump to search

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux

Software homepage
https://slurm.schedmd.com
Software availability
newer clusters
Other related software
replaces sge
command to type to run
sinfo sbatch srun scancel squeue
View online documentation
https://slurm.schedmd.com/quickstart.html
https://slurm.schedmd.com/tutorials.html
https://arcc.ist.ucf.edu/index.php/help/tutorials/job-submission-on-stokes-with-slurm
https://srcc.stanford.edu/sge-slurm-conversion
SLURM Command Option Summary (cheat sheet)
https://slurm.schedmd.com/pdfs/summary.pdf
Location of example files

Examples[edit]

batch example[edit]

Sample Job script:

#!/bin/bash
sleep 100

Make the script executable:

chmod +x testscript

Submit the script

sbatch testscript

example slurm script to use gpu[edit]

save the following as (for example) testjob.slurm

#!/bin/bash
#SBATCH -p gpu 
#SBATCH --gres=gpu:1 
#SBATCH -c 4
echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
python3 tens.py

then make it executable:

$ chmod +x testjob.slurm

Submit this job like this:

$ module load cuda cudnn
$ sbatch testjob.slurm

Note: you can also specify gpu type, for example: (pick one)

#SBATCH --gres=gpu:pascal:1 
#SBATCH --gres=gpu:kepler:1

Note the use of CUDA_VISIBLE_DEVICES -- printing the value of this may help with debugging. You can use this to match what gpus your job uses to the system performance graphs.

check slurm status and resource availablity[edit]

sinfo
get basic node status and partition list
sinfo -O nodelist,partition,gres,features
get list of available generic resources

Example:

$ sinfo
LIMIT  NODES  STATE NODELIST
normal*      up   infinite      4   idle c0-[1-4]
gpu          up   infinite      1   idle c0-0
$ sinfo -O nodelist,partition,nodes,gres,features
NODELIST            GRES                AVAIL_FEATURES      
c0-1                (null)              zswap               
c0-0                gpu:kepler:2        gpu                 
$ sinfo -O partition,nodelist,cpusstate,gres

In this example, there are two kepler gpus available (request with --gres=gpu:1) in the gpu partition (-p gpu)

As a simple example:

sbatch -p gpu --gres=gpu:1 --wrap="nvidia-smi"

or

srun -p gpu --gres=gpu:1  nvidia-smi

To check on running jobs:

 squeue

To cancel a job,

 scancel jobid

where jobid can be found in the output from squeue

To view information about completed jobs:

 sacct 

To view information about old jobs (for instance, jobs since Jan 2):

 sacct --start=0102

To view all information about a specific job by number: (replace ### with job number)

 sacct -j ### -o all -P | strans | less

The strans command (coupled with the sacct -p option) transposes the table to make it more readable. You can also try listing specific fields you want to see with sacct -o field,list or try spformat which separates out common fields and resizes column widths. (Also try spformat -fK to split into multiple small tables.)

For detailed information about a job that is either currently running or waiting to run, supply its jobid:

 scontrol show job 123

interactive example[edit]

Note: interactive jobs are discouraged. Resources requested by interactive sessions are unavailable to other users until the session closes. Abuse of this will cause limits to be placed on interactive sessions.

srun --pty -I

or if you want to run bash (login shell) interactively on a node:

srun --pty -I bash

Slurm options[edit]

These can be put on the command line of sbatch or srun or added in your batch script prefixed with #SBATCH

This is a few of the interesting options, for a complete list, check the sbatch man page.

Please note: slurm options are somewhat sensitive to order!

  • The partition option should be first ( -p gpu )
  • QOS options should be next
  • GPU requests should be after QOS
  • In srun, the command to run should be last with all slurm options before it.

Options that select resource allocation permissions (such as partition, qos) need to be early in the options.

common options[edit]

-c #
allocate # cpus per task (default=1)
-C features
request a node with special features; run sinfo -O features for a list of available features.
--exclusive
request an exclusive node rather than sharing the node with other jobs
--mem=#
memory needed per node
--mem-per-cpu=#
provide a minimum amount of memory per cpu (most clusters default to 8G)
--cpus-per-task=#
request # cpus for the job task (Note: number can be a range)
--nodes=#
allocate at least # nodes for the job (but see below)
-p partition
run in a specific partition instead of the default partition (list partitions with sinfo; some clusters have a gpu partition)
--wrap="command"
wrap a command in a shell script instead of specifying the script
-C 'feature|feature...'
request features available on some nodes. see sinfo -O features for a cluster specific list of features available.

gpu use[edit]

Use these slurm options to request gpus. As with other options, they can either be put on the sbatch command line or in your script.

--gres=gpu:1
request one gpu
--gres=gpu:pascal:1
request one pascal gpu (see sinfo -O gres)
-p gpu
select the gpu partition (needed on clusters with both gpu nodes and cpu only nodes, check with sinfo )

Deprecated slurm options[edit]

We do not recommend the following options because the defaults work well:

--cpus-per-gpu
DO NOT USE THIS OPTION. IT DOESN'T WORK IN THE CURRENT VERSION OF SLURM!
--output=filename_pattern
The default is to use 'slurm-%j.out' or 'slurm-%A_%a.out' which includes the jobid in the output filename, which makes it easy to match errors to failed jobs and saves a separate output log for each run. If you do change this option, make sure that it results in a unique name for the job in a writable directory.
--error=filename_pattern
If this option is used, errors from jobs will be saved in a separate file. Usually it is easier to use the default, which saves errors with the job output.
--mail
Because some users abused this option with large array jobs, causing excessive burden on the campus mail system, this option is disabled on most clusters. If you need external notification of job completion, talk to us and something can be arranged. Use this carefully on clusters that still have it enabled.
--nodes=#
This option is disabled on some clusters because it requires multiple node support within the code and some users inadvertently use it on jobs that only support single nodes. If you know your code works on multiple nodes and supports slurm host lists, contact us and it can be enabled for your account.
--nodelist
Please do not force slurm to use specific nodes. Slurm automatically picks the best node by default. Use of this option without compelling justification will cause it to be disabled. Please use feature lists (-C), or other resource requests instead. (Ask for the features you need and we'll add them.)
--exclude=nodelist
This option can be used if your job crashes on specific nodes; however slurm usually takes nodes offline when this occurs. If there is a problem with a node, please let us know ASAP so that it can be fixed rather than just excluding it in every job. If this feature is abused, it will be disabled.

Array jobs[edit]

Array jobs are a batch queue feature for embarrassingly parallel tasks where parallelization is trivial. For instance, if you need to run the same program on 100 different input files, you could create a script called myjob.sh containing:

myjob testcase-$SLURM_ARRAY_TASK_ID

and submit it like this:

 sbatch -a 1-100 myjob.sh

The program myjob would be run repeatedly with $SLURM_ARRAY_TASK_ID replaced with the numbers 1, 2, 3, ... 100

If you wanted to skip numbers, you could do something like this:

sbatch -a 4-20:3 myjob.sh

which would run jobs with $SLURM_ARRAY_TASK_ID set to 4, 7, 10, 13, 16, 19

If you have a large list of files to process that are not numbered sequentially, you could save the list of files to a file and extract it like this:

% ls datadir > filelist.txt
% wc -l filelist.txt
24532

(This list has 24532 files in it)

% sbatch -a 1-24532 processfiles.sh

In your job, you can extract the name like this:

#!/bin/bash
taskfile=`sed -n "${SLURM_ARRAY_TASK_ID}p" filelist.txt`
yourprogram $taskfile


If some of these jobs failed and you want to rerun them, you can use -a and list the tasks (and subranges of tasks) separated with commas.

Environment variables[edit]

For a complete list, check the sbatch man page.

SLURM_JOB_ID
The ID of the job allocation.
SLURM_ARRAY_TASK_ID
the current tasks's array index
SLURM_RESTART_COUNT
number of times this job has been restarted and requeued; use this to detect if you need to do something to restore a previous saved state
SLURM_CPUS_PER_TASK

Diagnostics and error codes[edit]

Jobs may fail at different stages for various reasons.

If you get an error during job submission, either you have a syntax error in the parameters or you have asked for resources that will never be available. Ask for help if this is not obvious to you.

If your job is submitted successfully but then disappears from the queue, it probably finished (successfully or unsuccessfully) very quickly. Slurm records basic accounting data and directs application output and error messages to one or two log files (as per job options). If the job fails quickly and log files are not written, most likely reason for a job failure is that you tried to start it in a directory where the log file can't be written, or you are over disk quota.

You can use the sacct command to check the status and post mortem statistics of a job.

The State column indicates slurm errors. Sometimes the reason column gives more details. The Exit code column shows an application specific numeric error code.

$ sacct -j 123456
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
123456            myjob        gpu     group3          8 OUT_OF_ME+    0:125 
123456.batch      batch                group3          8 OUT_OF_ME+    0:125 
123456.exte+     extern                group3          8  COMPLETED      0:0 

or use one of the wrappers for sacct ( sacct-diag sacct-mem )

$ sacct-diag -j 123456
NodeList=c3-1
Start=2024-04-20T09:10:01
End=2024-04-20T11:47:08
Elapsed=02:37:07

*******************************************************************************
User JobID ExitCode State Reason JobName
User    JobID         ExitC State         Reas JobName
ssd     123456        0:125 OUT_OF_MEMORY None myjob
        123456.batch  0:125 OUT_OF_MEMORY      batch 
        123456.extern 0:0   COMPLETED          extern
$ sacct-mem -j 123456
JobID        User          State All  ReqMem MaxVMSi NodeLi               Start 
------------ -------- ---------- --- ------- ------- ------ ------------------- 
123456            ssd OUT_OF_ME+   8  61.72G         c2-1   2024-04-20T09:10:01 
123456.batch          OUT_OF_ME+   8          61.14G c2-1   2024-04-20T09:10:01 
123456.exte+           COMPLETED   8               0 c2-1   2024-04-20T09:10:01 

Note that the MaxVMSize column is sampled, so it may not actually include the highest value.


Slurm displays the exit code separated into exit:signal. The exit value is generally application specific but a few common exit code meanings are: (These may or may not be relevant, since the application controls this.)

value meaning
0 success
nonzero failure, slurm will mark job as FAILED
1 general failure
2 incorrect use of shell builtin (bash) command
125 out of memory
126 command cannot execute (bash)
127 command not found (bash)
128 invalid argument to exit (bash)

Any other exit code (and some of the above) may be applicatin specific. This is the value passed to exit() by the application.

Some common signals that may cause an application to exit are listed here. For a complete list of signals, see man -s 7 signal.

2 equivalent of ctrl-c
6 application detected critical error and called abort()
8 floating point error
7 11 memory access error (your code is buggy)
9 15 slurm probably killed this job; canceled by user or time expired?
53 Failed to write output file (check quota and directory permissions)
125 out of memory