Help:SGE

From CECS wiki
(Redirected from Help:Sge)
Jump to navigation Jump to search

Sun Grid Engine (SGE) is an open source job queueing system for clusters.

NOTE: SGE is being replaced with slurm as clusters are upgraded.

Software availability
all rocks clusters
Other related software
help:rocks cluster, help:ganglia
View online documentation
see man pages for individual commands, especially man sge_intro and man qsub
SGE wiki
Rocks SGE Roll Documentation
http://gridengine.info/
Location of example files
Qsub and MPI example

Commands[edit]

User interface / job submission
qsub qmon qmake
System info
qhost qacct
Job info
qstat
Job manipulation
qdel qalter qresub qhold qmod qrls
Admin
qmon qconf qquota
misc
mpi-selector-menu
qsub submit jobs
qalter modify parameters of already submitted jobs
qmon graphical interface to gridengine
qstat show status (of hosts, jobs, queues, etc.)
qdel delete jobs

Windows users[edit]

Special note for windows users!

If you edit your SGE file in windows, the line ending convention may be wrong. This might work anyway, but it may give syntax errors in strange places. Also, windows tends to not terminate the last line in the file, causing SGE to IGNORE that last line.

To get around these problems:

  • If necessary, use dos2unix file.sge to fix the line endings after editing in windows.
  • Make sure there is a blank line at the end of the file when you edit in windows.

Example use[edit]

For a complete example, see qsub and MPI example.

See man pages for qsub qdel qmod and qstat for a complete list of options. A few are listed in the following sections here.

  • submit a job
% qsub -cwd testjob.csh
Your job 3412 ("testjob.csh") has been submitted
  • list jobs
% qstat
Job-ID  name        user   state queue                     slots 
  3412  testjob.csh ssd      r   all.q@compute-0-11.local   1

(note: output edited for clarity)

  • list nodes your jobs are running on
% qstat -f -ne
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@compute-1-11.local       BIP   1/2       0.00     lx26-amd64
  3412 0.60000 sleep      ssd          r     08/14/2009 16:00:25     1
  • suspend a job
% qmod -sj 3412

Note: suspended jobs remain in memory. Jobs that use a checkpointing method (such as fluent_ckpt for fluent) will also checkpoint before suspending. (Ask for help to get additional checkpointing methods installed.) qmod -usj will unsuspend a job.

  • kill jobs
% qdel 3412
ssd has registered the job 3412 for deletion


Additional sample commands:

qstat -f -ne
show all nodes your jobs are running on
qhost -F mem_free,swap_free,scratch
show memory and scratch space left on each node

more examples[edit]

To submit an executable without a startup script:

 qsub -cwd -b y myjob

For a complete example, see qsub and MPI example.

using a start up script[edit]

Embedding your job in a script allows you to move SGE command line options into the script so you don't have to retype them every time, and also allows you to add commands to set up the job environment before it runs, use SGE environment variables to adjust the job, and clean up after the job when it finishes. For example:

Use 4 cpus with mpich: (NOTE: NOT MPICH2)

#$ -pe mpich 4 -cwd
mpirun -np $NSLOTS myjob

If you saved this as myjobscript then you could submit it with qsub myjobscript

A more complex script might look like this: (colored section to the left is the script)

#$ -cwd # Lines starting with #$ are options to qsub
#$ -l mem_free=3G # Pick nodes with at least 3G of free memory.
#$ -q *@compute-1-* # (optional) limit this job to nodes in rack 1
fluent -sge -g 3d <<EOF # this command line will start fluent, and <<EOF uses the next lines as input to fluent
file/read-case-data test.cas.gz # input file from here until EOF
solve/iterate 1000
file/write-case-data final.gz
EOF # Additional shell commands can go after EOF

(see mmae:help:fluent for more details on this script.)

Array jobs[edit]

Array jobs are an SGE feature for embarrassingly parallel tasks where parallelization is trivial. For instance, if you need to run the same program on 100 different input files, you could create a script called myjob.sh containing:

myjob testcase-$SGE_TASK_ID.input

and submit it like this:

 qsub -t 1-100 myjob.sh

The program myjob would be run repeatedly with $SGE_TASK_ID replaced with the numbers 1, 2, 3, ... 100

Note: SGE does strange things when both -t and -pe are used together. (Ask for help if you can't get it to do what you want.)

If you wanted to skip numbers, you could do something like this:

qsub -t 4-20:3 myjob.sh

which would run jobs with $SGE_TASK_ID set to 4, 7, 10, 13, 16, 19

If you have a large list of files to process, you could save the list of files to a file and extract it like this:

% ls datadir > filelist.txt
% wc -l filelist.txt
24532

(This list has 24532 files in it)

% qsub -t 1-24532 processfiles.sge 

In your job, you can extract the name like this:

#$ -cwd
set taskfile=`sed -n "${SGE_TASK_ID}p" filelist.txt`
yourprogram $taskfile

Job status checks[edit]

qstat
check the status of your current pending and running jobs
qstat -s z
show recently completed jobs
qacct -d 1 -o USER -j
show jobs (-j) owned by user (-o user) that completed in the last day (-d 1)

SGE's built in reporting tools are a bit cumbersome. If there's some specific information you want to know about past or current jobs, drop me a note and I'll write a report generator for you.

Debugging failures[edit]

After you submit your job, you can use qstat (see above) to check on the status of your job.

% qstat
job-ID  prior   name       user         state submit/start at     queue             slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   206 0.00000 sleep      ssd          qw    09/26/2009 10:00:50                     5

Note that the STATE column shows what your job is doing:

qw
job is waiting to run; it may take 15 seconds for new jobs to be noticed. Jobs will then wait until resources are available.
r
job is running
Eqw
something is wrong with the job
hs
job is suspended and halted
d
job is being deleted

If your job is an error state (E), you should use qstat to figure out why:

% qstat
job-ID  prior   name       user         state submit/start at     queue             slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
   207 0.55500 sleep      ssd          Eqw   09/26/2009 10:05:26                     1
% qstat -explain E -j 207 | less
==============================================================
job_number:                 207
cwd:                        /export/home/ssd
job_name:                   sleep
error reason    1:          09/26/2009 10:05:32 [500:3147]: error: can't chdir to /export/home/ssd: No such file or di

(Note: above output is abbreviated for clarity)

Look for the error reason; note that there may be many other warnings or errors that may or may not be helpful.

The above error indicates that the target directory doesn't exist on the compute nodes. (Note: This specific error is because /export is only on the head node, do not reference it in your jobs, use the shorter paths that don't include it.)

Also, SGE saves the output from the job in log files named after the job:

% ls -lt sleep*
-rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.po206
-rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.pe206
-rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.o206
-rw-r--r-- 1 ssd ssd 0 Sep 26 09:10 sleep.e206
-rw-r--r-- 1 ssd ssd 0 Sep 26 10:04 sleep.o208
-rw-r--r-- 1 ssd ssd 0 Sep 26 10:04 sleep.e208

The *.p* files contain output of the commands needed to start parallel jobs.

The *.o* and *.po* commands contain the standard output of the job. The *.e* and *.pe* commands contain the standard error of the job. The qsub -j y option merges the *o* and *e* files.

These files are safe to delete if you don't need the console output from your job. (Note: my files above are all zero size, since the job output nothing to the screen and had no errors. Job 207 has no output at all, because it failed to run.)

If you can't figure out the cause of the errors, or your job sits in the queue without running for longer than you expect, ask for help.

Summary of interesting qsub options[edit]

(Read the man page for qsub for a complete list.)

-cwd
run the job in the same directory that qsub was run from instead of the home directory
-V
copy environment variables (including mpi paths, etc.) from the environment qsub was run in (NOTE: you may need to unset DISPLAY)
-v var
copy the value of a single environment variable from the current environment into the job (suggest: LD_LIBRARY_PATH if mpi needs it )
-v var=value
set an envrionment variable in the job
-j y
combine error output stream with the normal output stream instead of making two output files per job
-pe penv cpurange
select the desired parallel environment (see below) and number of processors
-S
change the default shell for interpreted scripts (default is csh)
-P short
suggest that a job be placed in the short queue; jobs in the short queue must complete in 24 hours or be killed, but if the cluster is full, they may get a higher priority and have nodes reserved for them
-l limit=value
specify a resource limit (see below)
-R y
request resources be reserved for this job (helps with scheduling large jobs when smaller jobs are also in the queue)

Parallel environments[edit]

Parallel environments are used to do preparation, such as building mpi host files and starting mpi daemons.

Note that some of these are local customizations not available on all machines. Use qconf -spl to list all parallel environments on the current cluster. Ask if there's one you want here that isn't in that list and it can be installed.

All platforms:

mpich
works best with MPICH verison 1
orte
(rocks 5.1 only) OpenRTE / OpenMPI tightly coupled

Some platforms:

lam
(rocks 5.0 only)
mpich2_mpd
works best with MPICH version 2
mpi
mpich1 loose integration (depreciated)
fluent_pe
special parallel environment for fluent (along with fluent_ckpt)

Special variations of the above (replace * with one of the above) include:

*-one
tuned for jobs that can't span nodes
*-split
allows jobs to span nodes on systems where the default is to not allow it

Resource limits and requests[edit]

By default, jobs are allowed unlimited resources unless limited by the user. The short queue request includes a 24 hour real time limit. A complete list of resources can be found in the SGE documentation. (See man pages for complex, queue_conf and other pages; a complete list of resources can be found with qconf -sc) Resource limits are specified with the -l limit=value option.

Note that some resources are per job, some are per node, and some are per cpu / thread (as indicated here). Also, some of these set limits, and others request nodes that have the specified resources. For example:

-l h_vmem=3G
kill the job if it tries to use more than 3G of ram per thread
-l mem_free=3G
only pick nodes for the job if the node has at least 3G of ram free

A few common limits are:

s_rt
soft real time: send a signal after this time expires
h_rt
hard real time: kill job after this time
s_cpu h_cpu
soft and hard cpu limits; total cpu time for job
mem_free
request nodes with at least this much memory free (ex: -l mem_free=3G )
s_vmem h_vmem
job will get a signal (be killed) if the combined virtual memory use (multiplied by number of slots) is greater
scratch
request nodes with at least this much free space in /state/partition1 ; (ex: -l scratch=40G ) (on hilbert, euler only, or ask if you need this)
virtual_free
request nodes with at least this much virtual memory free
exclusive
request an entire node (on select clusters, may help matlab)
gpu
request a gpu unit

Checkpoint and restart[edit]

A checkpointable job can save its state so that it can be resumed later. SGE currently requires application support for checkpointing to work. Applications typically support the following types of checkpointing:

passive checkpointing
The application checkpoints on its own periodically.
active checkpointing
SGE tells the application to checkpoint

Checkpointing can be activated by adding qsub options to your start script, for example:

#$ -cwd
#$ -ckpt caffe_ckpt -c 2:00:00 -r y 
#$ -P short

Options are:

-ckpt caffe_ckpt
Use the caffe checkpointing method (see below)
-c hh
mm:ss : Checkpoint at the specified interval
-r y : This job can be restarted after it is checkpointed or if it aborts (see [[#Exit codes|]] below)
-P short
Allow this job to use the short queue which has a maximum run time limit

Each cluster has a different configured minimum checkpoint interval and maximum time for the short queue.

You can get a list of the configured checkpoint methods on a cluster with the command

qconf -sckptl

If you need additional checkpoint methods, please ask us to add what you need. Any job can use any configured checkpoint method, but some methods are known to work with specific applications.

Known checkpoint methods are:

caffe_ckpt
send a ctrl-c to the job when it needs to be checkpointed
fluent_ckpt
methods specific to the fluent application
lsdyna_ckpt
methods specific to the lsdyna application


If you include the restart option and suspend it, it may cause it to be migrated to another machine, or allow others to run their jobs before your job is restarted.

By specifying checkpoint and restart and the short queue, your jobs will be given more resources than normally would be available, because jobs in the short queue will be terminated at the end of the runtime limit, allowing others to run their jobs before yours is restarted again.

You can manually force a checkpoint and restart by using qmod -sj to suspend the job.

Choosing nodes[edit]

Some clusters are non-homogeneous and have nodes with special significance (i.e., newer, upgraded memory, special hardware, etc.). You can request that SGE pick nodes from a specific set of nodes by specifying them with the -q option. Note that this must be quoted if used from the command line. Examples:

by rack
*@compute-1-*
by group
*@@group1

You can get a list of groups with the command

qconf -shgrpl

or the list of a particular group with qconf -shgrp @group1 For example, -q '*@@group1' or, in the batch script:

#$ -q *@@group1

You can also check the resources available on nodes by clicking on Physical View in ganglia

exclusive node jobs[edit]

If a job needs exclusive access to a node (i.e., parallel matlab jobs), you can request it with

-l exclusive

This will cause the job to run on an empty node by itself. No parallel environment request is needed. Note that you may also need to request a reservation (-R y) to prevent starvation by other non-exclusive jobs.

Environment variables[edit]

SGE defines a few environment variables that may be useful in your sge job scripts.

$JOB_NAME
name assigned to this job (either name of the script, or value of the qsub -N option)
$JOB_ID
a unique number identifying this job
$NSLOTS
number of cpu slots assigned to this job
$NHOSTS
number of nodes assigned to this job (will be LOWER than $NSLOTS if there are multiple cpus per node)
$TMPDIR
directory containing job description scratch files (DO NOT PUT LARGE DATA FILES HERE)
$TMPDIR/machines
(pe's mpi mpich mpich2 only) machine file generated by the parallel environment specific to the version of mpi specified
$PE_HOSTFILE
SGE generated machine file (see man sge_pe under heading $pe_hostfile for complete format description)
one host per line, columns are: hostname #cpus queue-name
$SGE_TASK_ID
(array jobs only) index of the current task within the array job

Additional variables are listed in the qsub man page.

Exit codes[edit]

SGE jobs can exit with an error code to tell the queue system what to do next. These are documented in the man page for sge_shepherd.

0
Success (no error)
1
General failure exit
99
Retry this job if allowed
100
Retry job if allowed, but don't enable dependent tasks
other
Other errors

Additional notes[edit]

  • Note that qsub runs the job from your home directory by default. Specify the -cwd option to run it from the current directory.
  • Overview pages showing use of Sun Grid Engine in other cluster environments - Includes sample submit scripts
  • SGE seems to force script jobs (which might contain SGE options) to run as csh scripts unless given the -S /bin/sh option, so complicated shell script constructs must use the csh syntax or include this option. Alternately, if your shell script uses sh (bourne shell) features (or some other interpreter other than csh), and you don't need SGE options read from the script itself, you can use the -b y option to execute the script with the exec system call instead of csh.

Debugging jobs[edit]

  • If your job fails to start (stays in qw state for more than 20 seconds) you can use qstat -j to examine why it is not being started. If the cluster is not full and it stays this way for more than 5 minutes, please contact us.
  • SGE by default saves output from the job in logs that end in the job number. You should examine these logs for clues as to why your job may have aborted. If you want to watch these logs as your job runs, you can use tail -f *#### replacing #### with the job number. (ctrl-c will exit this without disturbing your job.)
  • If there are no errors and your job seems to complete without actually running, check the syntax of your start script. Some editors in windows will leave the last line of the file incomplete, which will cause it to be ignored. To be sure this does not occur, always add a blank line at the end of your script after editing it in windows.

Output buffering[edit]

When programs write their output to a file, it is block buffered by default, meaning that no output will appear until a full block of data has been written. If there is a large amount of output, this is significantly more efficient.

If your program outputs data slowly, and you want to view the output file as it progresses (with tail -f above or some other method), you may want to make it line buffered instead by prefixing the command with stdbuf -o L

If your script looks like this:

#$ -cwd
./mycode

change it to this:

#$ -cwd
stdbuf -o L ./mycode

Application specific notes[edit]

applications with their own page[edit]

See also:

Ansys CFX[edit]

#$ -l mem_free=3G 
#$ -cwd 
#$ -pe mpich 4
set hosts=`cat $TMPDIR/machines | tr \\010\\012 ,,`
echo $hosts
/share/apps/ansys_inc/v130/CFX/bin/cfx5solve -batch -definition inputfile.def  -parallel -par-dist $hosts -start-method "MPICH Distributed Parallel" 

Other interesting options (Check with CFX documentation):

  • -initial filename.res

Autopartition options:

  • -partition $NSLOTS
  • -parfile-save filename
  • -parfile-read FILENAME

OpenFOAM[edit]

(path is for ariel and euler)

#$ -S /bin/bash
#$ -cwd
#$ -l mem_free=3G
#$ -pe mpich 4 

source /share/apps/OpenFOAM/OpenFOAM-1.7.1/etc/bashrc
decomposePar
mpirun -np $NSLOTS -bynode DATADIR -parallel
reconstructPar -latestTime

quantum espresso[edit]

This doesn't use ATLAS local threads, but MPI does seem to work. Note that more than 4 cpus can be used.

#$ -pe mpich 4
#$ -S /bin/sh

export TMP_DIR PSEUDO_DIR BIN_DIR
ESPRESSO=/share/apps/espresso
BIN_DIR=$ESPRESSO/bin
PSEUDO_DIR=$ESPRESSO/pseudo
TMP_DIR=/state/partition1/g-$USER.$JOB_ID
mkdir -p $TMP_DIR
mpirun -np $NSLOTS -machinefile $TMPDIR/machines $BIN_DIR/pw.x < $1.in > $1:r.out.$JOB_ID
# maybe run other output processing here?? or save things from TMP_DIR
rm -rf $TMP_DIR

Gaussian[edit]

  • http://wiki.cse.ucdavis.edu/support:hpc:software:gaussian
  • http://www.gaussian.com/g_tech/g_ur/g09help.htm
  • the %chk option causes gausian to checkpoint
    • it is (probably) restartable from the checkpoint with no changes to the job script as long as the job script doesn't mess with the checkpoint file
  • there is no way for SGE to trigger a checkpoint or exit, and restarting will probably fail if the job is killed while it is checkpointing (so stopping gaussian is dangerous)
  • restarting is untested
  • Gaussian doesn't support MPI hostfiles apparently, so you must force it to not be split across hosts with
#$ -pe mpich-one 4
  • Suggested (but untested) head of your gaussian input file with scratch files should probably be something like
scratch=/state/partition1/g-$USER.$JOB_ID
mkdir -p $scratch
g03 <<EOF
%RWF=$scratch
%NOSAVE
...
EOF
rmdir $scratch

The %NOSAVE option should cause gaussian to delete files mentioned previous to %NOSAVE on successful job completion.

  • Note: saving scratch files to your home directory is a very heavy burden on the system and practically eliminates the advantages of a parallel cluster. Jobs caught doing this will most likely be killed. If you need lots of scratch space, use the -l scratch=40G option (or similar) to request nodes with a specific amount of space. If you need more than this, ask for help.

Python scripts[edit]

Please note that you can't run python scripts directly. If you try, SGE will try to interpret the python script using csh. Instead, create a separate sge script to run python. For example, to run a python script as an array job, use these two files:

  • test.sge
#$ -cwd
#$ -t 2-5 
module load opt-python
python test.py $SGE_TASK_ID
  • test.py
#!/usr/bin/python 
import sys
print "taskid=" + sys.argv[1];

GPU CUDA applications[edit]

  • use -l gpu=1 to request one gpu for your job
  • load the cuda environment in your script
  • Note that some applications may need additional modules. cudnn might be include with cuda

See also pages for individual software packages, such as those in Category:Deep learning software.

gpu.sge:

#$ -cwd
#$ -l gpu=1
module load cuda opencv caffe opt-python theano-force-gpu
caffe.bin train --solver=solver.prototxt


Note: If you use theano, you might want to add this near the top of your script to force theano to use the gpu and use python 2.7:

module load opt-python theano-force-gpu

Graphical environment in batch mode[edit]

Some jobs need non-interactive graphical interface access to function correctly. For example, STAR CCM+ and Fluent both want access to the screen to generate images and movies from their simulations.

Add the following near the top of your batch job to create a virtual graphical environment:

eval `vfbstartx-csh`

(Note: if you are not using copy/paste, those are back ticks above.)

This works by starting Xvfb to create a virtual graphical terminal and initializing the current environment to use it. Xvfb should shutdown and exit after the first graphical command uses it.

If you need this and it does not work, let us know. It is possible that vfstartx-csh or Xvfb are not installed on some clusters.