Help:SGE
Sun Grid Engine (SGE) is an open source job queueing system for clusters.
NOTE: SGE is being replaced with slurm as clusters are upgraded.
- Software availability
- all rocks clusters
- Other related software
- help:rocks cluster, help:ganglia
- View online documentation
- see man pages for individual commands, especially man sge_intro and man qsub
- SGE wiki
- Rocks SGE Roll Documentation
- http://gridengine.info/
- Location of example files
- Qsub and MPI example
Commands[edit]
- User interface / job submission
- qsub qmon qmake
- System info
- qhost qacct
- Job info
- qstat
- Job manipulation
- qdel qalter qresub qhold qmod qrls
- Admin
- qmon qconf qquota
- misc
- mpi-selector-menu
qsub | submit jobs |
qalter | modify parameters of already submitted jobs |
qmon | graphical interface to gridengine |
qstat | show status (of hosts, jobs, queues, etc.) |
qdel | delete jobs |
Windows users[edit]
Special note for windows users!
If you edit your SGE file in windows, the line ending convention may be wrong. This might work anyway, but it may give syntax errors in strange places. Also, windows tends to not terminate the last line in the file, causing SGE to IGNORE that last line.
To get around these problems:
- If necessary, use dos2unix file.sge to fix the line endings after editing in windows.
- Make sure there is a blank line at the end of the file when you edit in windows.
Example use[edit]
For a complete example, see qsub and MPI example.
See man pages for qsub qdel qmod and qstat for a complete list of options. A few are listed in the following sections here.
- submit a job
% qsub -cwd testjob.csh Your job 3412 ("testjob.csh") has been submitted
- list jobs
% qstat Job-ID name user state queue slots 3412 testjob.csh ssd r all.q@compute-0-11.local 1
(note: output edited for clarity)
- list nodes your jobs are running on
% qstat -f -ne queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@compute-1-11.local BIP 1/2 0.00 lx26-amd64 3412 0.60000 sleep ssd r 08/14/2009 16:00:25 1
- suspend a job
% qmod -sj 3412
Note: suspended jobs remain in memory. Jobs that use a checkpointing method (such as fluent_ckpt for fluent) will also checkpoint before suspending. (Ask for help to get additional checkpointing methods installed.) qmod -usj will unsuspend a job.
- kill jobs
% qdel 3412 ssd has registered the job 3412 for deletion
Additional sample commands:
- qstat -f -ne
- show all nodes your jobs are running on
- qhost -F mem_free,swap_free,scratch
- show memory and scratch space left on each node
more examples[edit]
To submit an executable without a startup script:
qsub -cwd -b y myjob
For a complete example, see qsub and MPI example.
using a start up script[edit]
Embedding your job in a script allows you to move SGE command line options into the script so you don't have to retype them every time, and also allows you to add commands to set up the job environment before it runs, use SGE environment variables to adjust the job, and clean up after the job when it finishes. For example:
Use 4 cpus with mpich: (NOTE: NOT MPICH2)
#$ -pe mpich 4 -cwd mpirun -np $NSLOTS myjob
If you saved this as myjobscript then you could submit it with qsub myjobscript
A more complex script might look like this: (colored section to the left is the script)
#$ -cwd | # Lines starting with #$ are options to qsub |
#$ -l mem_free=3G | # Pick nodes with at least 3G of free memory. |
#$ -q *@compute-1-* | # (optional) limit this job to nodes in rack 1 |
fluent -sge -g 3d <<EOF | # this command line will start fluent, and <<EOF uses the next lines as input to fluent |
file/read-case-data test.cas.gz | # input file from here until EOF |
solve/iterate 1000 | |
file/write-case-data final.gz | |
EOF | # Additional shell commands can go after EOF |
(see mmae:help:fluent for more details on this script.)
Array jobs[edit]
Array jobs are an SGE feature for embarrassingly parallel tasks where parallelization is trivial. For instance, if you need to run the same program on 100 different input files, you could create a script called myjob.sh containing:
myjob testcase-$SGE_TASK_ID.input
and submit it like this:
qsub -t 1-100 myjob.sh
The program myjob would be run repeatedly with $SGE_TASK_ID replaced with the numbers 1, 2, 3, ... 100
Note: SGE does strange things when both -t and -pe are used together. (Ask for help if you can't get it to do what you want.)
If you wanted to skip numbers, you could do something like this:
qsub -t 4-20:3 myjob.sh
which would run jobs with $SGE_TASK_ID set to 4, 7, 10, 13, 16, 19
If you have a large list of files to process, you could save the list of files to a file and extract it like this:
% ls datadir > filelist.txt % wc -l filelist.txt 24532
(This list has 24532 files in it)
% qsub -t 1-24532 processfiles.sge
In your job, you can extract the name like this:
#$ -cwd set taskfile=`sed -n "${SGE_TASK_ID}p" filelist.txt` yourprogram $taskfile
Job status checks[edit]
- qstat
- check the status of your current pending and running jobs
- qstat -s z
- show recently completed jobs
- qacct -d 1 -o USER -j
- show jobs (-j) owned by user (-o user) that completed in the last day (-d 1)
SGE's built in reporting tools are a bit cumbersome. If there's some specific information you want to know about past or current jobs, drop me a note and I'll write a report generator for you.
Debugging failures[edit]
After you submit your job, you can use qstat (see above) to check on the status of your job.
% qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 206 0.00000 sleep ssd qw 09/26/2009 10:00:50 5
Note that the STATE column shows what your job is doing:
- qw
- job is waiting to run; it may take 15 seconds for new jobs to be noticed. Jobs will then wait until resources are available.
- r
- job is running
- Eqw
- something is wrong with the job
- hs
- job is suspended and halted
- d
- job is being deleted
If your job is an error state (E), you should use qstat to figure out why:
% qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 207 0.55500 sleep ssd Eqw 09/26/2009 10:05:26 1 % qstat -explain E -j 207 | less ============================================================== job_number: 207 cwd: /export/home/ssd job_name: sleep error reason 1: 09/26/2009 10:05:32 [500:3147]: error: can't chdir to /export/home/ssd: No such file or di
(Note: above output is abbreviated for clarity)
Look for the error reason; note that there may be many other warnings or errors that may or may not be helpful.
The above error indicates that the target directory doesn't exist on the compute nodes. (Note: This specific error is because /export is only on the head node, do not reference it in your jobs, use the shorter paths that don't include it.)
Also, SGE saves the output from the job in log files named after the job:
% ls -lt sleep* -rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.po206 -rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.pe206 -rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.o206 -rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.e206 -rw-r--r-- 1 ssd ssd 0 Sep 26 10:04 sleep.o208 -rw-r--r-- 1 ssd ssd 0 Sep 26 10:04 sleep.e208
The *.p* files contain output of the commands needed to start parallel jobs.
The *.o* and *.po* commands contain the standard output of the job. The *.e* and *.pe* commands contain the standard error of the job. The qsub -j y option merges the *o* and *e* files.
These files are safe to delete if you don't need the console output from your job. (Note: my files above are all zero size, since the job output nothing to the screen and had no errors. Job 207 has no output at all, because it failed to run.)
If you can't figure out the cause of the errors, or your job sits in the queue without running for longer than you expect, ask for help.
Summary of interesting qsub options[edit]
(Read the man page for qsub for a complete list.)
- -cwd
- run the job in the same directory that qsub was run from instead of the home directory
- -V
- copy environment variables (including mpi paths, etc.) from the environment qsub was run in (NOTE: you may need to unset DISPLAY)
- -v var
- copy the value of a single environment variable from the current environment into the job (suggest: LD_LIBRARY_PATH if mpi needs it )
- -v var=value
- set an envrionment variable in the job
- -j y
- combine error output stream with the normal output stream instead of making two output files per job
- -pe penv cpurange
- select the desired parallel environment (see below) and number of processors
- -S
- change the default shell for interpreted scripts (default is csh)
- -P short
- suggest that a job be placed in the short queue; jobs in the short queue must complete in 24 hours or be killed, but if the cluster is full, they may get a higher priority and have nodes reserved for them
- -l limit=value
- specify a resource limit (see below)
- -R y
- request resources be reserved for this job (helps with scheduling large jobs when smaller jobs are also in the queue)
Parallel environments[edit]
Parallel environments are used to do preparation, such as building mpi host files and starting mpi daemons.
Note that some of these are local customizations not available on all machines. Use qconf -spl to list all parallel environments on the current cluster. Ask if there's one you want here that isn't in that list and it can be installed.
All platforms:
- mpich
- works best with MPICH verison 1
- orte
- (rocks 5.1 only) OpenRTE / OpenMPI tightly coupled
Some platforms:
- lam
- (rocks 5.0 only)
- mpich2_mpd
- works best with MPICH version 2
- mpi
- mpich1 loose integration (depreciated)
- fluent_pe
- special parallel environment for fluent (along with fluent_ckpt)
Special variations of the above (replace * with one of the above) include:
- *-one
- tuned for jobs that can't span nodes
- *-split
- allows jobs to span nodes on systems where the default is to not allow it
Resource limits and requests[edit]
By default, jobs are allowed unlimited resources unless limited by the user. The short queue request includes a 24 hour real time limit. A complete list of resources can be found in the SGE documentation. (See man pages for complex, queue_conf and other pages; a complete list of resources can be found with qconf -sc) Resource limits are specified with the -l limit=value option.
Note that some resources are per job, some are per node, and some are per cpu / thread (as indicated here). Also, some of these set limits, and others request nodes that have the specified resources. For example:
- -l h_vmem=3G
- kill the job if it tries to use more than 3G of ram per thread
- -l mem_free=3G
- only pick nodes for the job if the node has at least 3G of ram free
A few common limits are:
- s_rt
- soft real time: send a signal after this time expires
- h_rt
- hard real time: kill job after this time
- s_cpu h_cpu
- soft and hard cpu limits; total cpu time for job
- mem_free
- request nodes with at least this much memory free (ex: -l mem_free=3G )
- s_vmem h_vmem
- job will get a signal (be killed) if the combined virtual memory use (multiplied by number of slots) is greater
- scratch
- request nodes with at least this much free space in /state/partition1 ; (ex: -l scratch=40G ) (on hilbert, euler only, or ask if you need this)
- virtual_free
- request nodes with at least this much virtual memory free
- exclusive
- request an entire node (on select clusters, may help matlab)
- gpu
- request a gpu unit
Checkpoint and restart[edit]
A checkpointable job can save its state so that it can be resumed later. SGE currently requires application support for checkpointing to work. Applications typically support the following types of checkpointing:
- passive checkpointing
- The application checkpoints on its own periodically.
- active checkpointing
- SGE tells the application to checkpoint
Checkpointing can be activated by adding qsub options to your start script, for example:
#$ -cwd #$ -ckpt caffe_ckpt -c 2:00:00 -r y #$ -P short
Options are:
- -ckpt caffe_ckpt
- Use the caffe checkpointing method (see below)
- -c hh
- mm:ss : Checkpoint at the specified interval
- -r y : This job can be restarted after it is checkpointed or if it aborts (see [[#Exit codes|]] below)
- -P short
- Allow this job to use the short queue which has a maximum run time limit
Each cluster has a different configured minimum checkpoint interval and maximum time for the short queue.
You can get a list of the configured checkpoint methods on a cluster with the command
qconf -sckptl
If you need additional checkpoint methods, please ask us to add what you need. Any job can use any configured checkpoint method, but some methods are known to work with specific applications.
Known checkpoint methods are:
- caffe_ckpt
- send a ctrl-c to the job when it needs to be checkpointed
- fluent_ckpt
- methods specific to the fluent application
- lsdyna_ckpt
- methods specific to the lsdyna application
If you include the restart option and suspend it, it may cause it to be migrated to another machine, or allow others to run their jobs before your job is restarted.
By specifying checkpoint and restart and the short queue, your jobs will be given more resources than normally would be available, because jobs in the short queue will be terminated at the end of the runtime limit, allowing others to run their jobs before yours is restarted again.
You can manually force a checkpoint and restart by using qmod -sj to suspend the job.
Choosing nodes[edit]
Some clusters are non-homogeneous and have nodes with special significance (i.e., newer, upgraded memory, special hardware, etc.). You can request that SGE pick nodes from a specific set of nodes by specifying them with the -q option. Note that this must be quoted if used from the command line. Examples:
- by rack
- *@compute-1-*
- by group
- *@@group1
You can get a list of groups with the command
qconf -shgrpl
or the list of a particular group with qconf -shgrp @group1 For example, -q '*@@group1' or, in the batch script:
#$ -q *@@group1
You can also check the resources available on nodes by clicking on Physical View in ganglia
exclusive node jobs[edit]
If a job needs exclusive access to a node (i.e., parallel matlab jobs), you can request it with
-l exclusive
This will cause the job to run on an empty node by itself. No parallel environment request is needed. Note that you may also need to request a reservation (-R y) to prevent starvation by other non-exclusive jobs.
Environment variables[edit]
SGE defines a few environment variables that may be useful in your sge job scripts.
- $JOB_NAME
- name assigned to this job (either name of the script, or value of the qsub -N option)
- $JOB_ID
- a unique number identifying this job
- $NSLOTS
- number of cpu slots assigned to this job
- $NHOSTS
- number of nodes assigned to this job (will be LOWER than $NSLOTS if there are multiple cpus per node)
- $TMPDIR
- directory containing job description scratch files (DO NOT PUT LARGE DATA FILES HERE)
- $TMPDIR/machines
- (pe's mpi mpich mpich2 only) machine file generated by the parallel environment specific to the version of mpi specified
- $PE_HOSTFILE
- SGE generated machine file (see man sge_pe under heading $pe_hostfile for complete format description)
- one host per line, columns are: hostname #cpus queue-name
- $SGE_TASK_ID
- (array jobs only) index of the current task within the array job
Additional variables are listed in the qsub man page.
Exit codes[edit]
SGE jobs can exit with an error code to tell the queue system what to do next. These are documented in the man page for sge_shepherd.
- 0
- Success (no error)
- 1
- General failure exit
- 99
- Retry this job if allowed
- 100
- Retry job if allowed, but don't enable dependent tasks
- other
- Other errors
Additional notes[edit]
- Note that qsub runs the job from your home directory by default. Specify the -cwd option to run it from the current directory.
- Overview pages showing use of Sun Grid Engine in other cluster environments - Includes sample submit scripts
- Basic Overview: Basement Sun Grid Engine Quick Start
- More Details: Using MPICH2 on Merlin3 @ the Paul Scherrer Institute
- SGE seems to force script jobs (which might contain SGE options) to run as csh scripts unless given the -S /bin/sh option, so complicated shell script constructs must use the csh syntax or include this option. Alternately, if your shell script uses sh (bourne shell) features (or some other interpreter other than csh), and you don't need SGE options read from the script itself, you can use the -b y option to execute the script with the exec system call instead of csh.
Debugging jobs[edit]
- If your job fails to start (stays in qw state for more than 20 seconds) you can use qstat -j to examine why it is not being started. If the cluster is not full and it stays this way for more than 5 minutes, please contact us.
- SGE by default saves output from the job in logs that end in the job number. You should examine these logs for clues as to why your job may have aborted. If you want to watch these logs as your job runs, you can use tail -f *#### replacing #### with the job number. (ctrl-c will exit this without disturbing your job.)
- If there are no errors and your job seems to complete without actually running, check the syntax of your start script. Some editors in windows will leave the last line of the file incomplete, which will cause it to be ignored. To be sure this does not occur, always add a blank line at the end of your script after editing it in windows.
Output buffering[edit]
When programs write their output to a file, it is block buffered by default, meaning that no output will appear until a full block of data has been written. If there is a large amount of output, this is significantly more efficient.
If your program outputs data slowly, and you want to view the output file as it progresses (with tail -f above or some other method), you may want to make it line buffered instead by prefixing the command with stdbuf -o L
If your script looks like this:
#$ -cwd ./mycode
change it to this:
#$ -cwd stdbuf -o L ./mycode
Application specific notes[edit]
applications with their own page[edit]
See also:
Ansys CFX[edit]
#$ -l mem_free=3G #$ -cwd #$ -pe mpich 4 set hosts=`cat $TMPDIR/machines | tr \\010\\012 ,,` echo $hosts /share/apps/ansys_inc/v130/CFX/bin/cfx5solve -batch -definition inputfile.def -parallel -par-dist $hosts -start-method "MPICH Distributed Parallel"
Other interesting options (Check with CFX documentation):
- -initial filename.res
Autopartition options:
- -partition $NSLOTS
- -parfile-save filename
- -parfile-read FILENAME
OpenFOAM[edit]
(path is for ariel and euler)
#$ -S /bin/bash #$ -cwd #$ -l mem_free=3G #$ -pe mpich 4 source /share/apps/OpenFOAM/OpenFOAM-1.7.1/etc/bashrc decomposePar mpirun -np $NSLOTS -bynode DATADIR -parallel reconstructPar -latestTime
quantum espresso[edit]
This doesn't use ATLAS local threads, but MPI does seem to work. Note that more than 4 cpus can be used.
#$ -pe mpich 4 #$ -S /bin/sh export TMP_DIR PSEUDO_DIR BIN_DIR ESPRESSO=/share/apps/espresso BIN_DIR=$ESPRESSO/bin PSEUDO_DIR=$ESPRESSO/pseudo TMP_DIR=/state/partition1/g-$USER.$JOB_ID mkdir -p $TMP_DIR mpirun -np $NSLOTS -machinefile $TMPDIR/machines $BIN_DIR/pw.x < $1.in > $1:r.out.$JOB_ID # maybe run other output processing here?? or save things from TMP_DIR rm -rf $TMP_DIR
Gaussian[edit]
- http://wiki.cse.ucdavis.edu/support:hpc:software:gaussian
- http://www.gaussian.com/g_tech/g_ur/g09help.htm
- the %chk option causes gausian to checkpoint
- it is (probably) restartable from the checkpoint with no changes to the job script as long as the job script doesn't mess with the checkpoint file
- there is no way for SGE to trigger a checkpoint or exit, and restarting will probably fail if the job is killed while it is checkpointing (so stopping gaussian is dangerous)
- restarting is untested
- Gaussian doesn't support MPI hostfiles apparently, so you must force it to not be split across hosts with
#$ -pe mpich-one 4
- Suggested (but untested) head of your gaussian input file with scratch files should probably be something like
scratch=/state/partition1/g-$USER.$JOB_ID mkdir -p $scratch g03 <<EOF %RWF=$scratch %NOSAVE ... EOF rmdir $scratch
The %NOSAVE option should cause gaussian to delete files mentioned previous to %NOSAVE on successful job completion.
- Note: saving scratch files to your home directory is a very heavy burden on the system and practically eliminates the advantages of a parallel cluster. Jobs caught doing this will most likely be killed. If you need lots of scratch space, use the -l scratch=40G option (or similar) to request nodes with a specific amount of space. If you need more than this, ask for help.
Python scripts[edit]
Please note that you can't run python scripts directly. If you try, SGE will try to interpret the python script using csh. Instead, create a separate sge script to run python. For example, to run a python script as an array job, use these two files:
- test.sge
#$ -cwd #$ -t 2-5 module load opt-python python test.py $SGE_TASK_ID
- test.py
#!/usr/bin/python import sys print "taskid=" + sys.argv[1];
GPU CUDA applications[edit]
- use -l gpu=1 to request one gpu for your job
- load the cuda environment in your script
- Note that some applications may need additional modules. cudnn might be include with cuda
See also pages for individual software packages, such as those in Category:Deep learning software.
gpu.sge:
#$ -cwd #$ -l gpu=1 module load cuda opencv caffe opt-python theano-force-gpu caffe.bin train --solver=solver.prototxt
Note: If you use theano, you might want to add this near the top of your script to force theano to use the gpu and use python 2.7:
module load opt-python theano-force-gpu
Graphical environment in batch mode[edit]
Some jobs need non-interactive graphical interface access to function correctly. For example, STAR CCM+ and Fluent both want access to the screen to generate images and movies from their simulations.
Add the following near the top of your batch job to create a virtual graphical environment:
eval `vfbstartx-csh`
(Note: if you are not using copy/paste, those are back ticks above.)
This works by starting Xvfb to create a virtual graphical terminal and initializing the current environment to use it. Xvfb should shutdown and exit after the first graphical command uses it.
If you need this and it does not work, let us know. It is possible that vfstartx-csh or Xvfb are not installed on some clusters.