Help:Crcv

From CECS wiki
Jump to navigation Jump to search
Home page
http://crcv.ucf.edu/
Status
http://crcv.eecs.ucf.edu/
Research focus
Research in Computer Vision
Operating System
Ubuntu

See also:

Guides[edit]

  • This is a cluster running Ubuntu linux
  • Please use the help:slurm batch queue system to run your jobs on the cluster and share resources fairly with other users.
  • Access to this cluster is through ssh which is native to MacOS and Linux, and has windows clients available
  • You can see the load and status of this cluster with help:ganglia and the GPU heatmap
  • This cluster currently has several gpu nodes. Use the slurm options to check for number and models and the correct options to request gpus for jobs.
  • Let us know if you need additional software (listed at the bottom of nvidia) not listed here.
  • Read directions below to access software available through modules
  • See also Data antipatterns and optimizations

Datasets[edit]

Many datasets are already on the crcv cluster. Most datasets are in /share/datasets/

  • Before you download a dataset, please check to see if it is already downloaded. Check with others if you can't find what you want.
  • Do not copy datasets, instead use them where they are. Copying is slow, resource intensive, and unnecessary.
  • Many datasets are optimized for fast access. Copying the dataset removes the optimization.
  • Datasets are large. Duplicating the dataset wastes space. If you must, use symbolic links instead.
  • Shared datasets share cache space, speeding up the whole system.
  • If you do download a dataset, don't unpack it. Either use libraries or fuse modules to read the zip or tar directly, or convert it directly to a squashfs without unpacking it (ask for help).
  • If you generate a dataset, try to keep the contents at each directory level between 100-1000 entries. Large numbers of items in a single directory will become exponentially slower.

example slurm script to use gpu[edit]

Create a startup script and save it, for example as myjob.slurm

#!/bin/bash
#SBATCH -p gpu --gres=gpu:1 
#SBATCH -c 4
echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
caffe.bin train --solver=solver.prototxt

Submit the job with:

 module load cuda cudnn
 sbatch myjob.slurm

Note: you can also specify gpu type by selecting the right gres, for example: (pick one and replace the line above)

#SBATCH -p gpu --gres=gpu:pascal:1

There are two gpu partitions:

-p short
includes pascals and other <12G gpus. Max job length, 1 day, defaults to 3 hours. Suitable for short jobs and interactive debugging sessions.
-p gpu
includes higher end gpus that are in high demand

If neither of these options are given, the job will default to the CPU partition. If your job doesn't need a gpu, please use the cpu partition. Please run long running memory and cpu intensive jobs in the cpu partition rather than on the login node. (srun -c6 --pty may be appropriate, especially for conda install.)

slurm options[edit]

These slurm options are specific to the CRCV cluster.

  • Get a complete list of current features available with
 sinfo -O features:

(Note: trailing colon is needed!) More detailed information:

 sinfo -e --Format "Nodelist:.15,Features:.50,Gres:.30,Memory:.15,CPUs:.15"

These options allow you to pick nodes by specific features available:

--gpus=1 -C 'pascal|turing|volta'
select any gpu except ampere
--gres:gpu:pascal:1
select pascal gpu
--gres:gpu:volta:1
select volta gpu
--gres:gpu:turing:1
select turing gpu
-C infiniband
select node with high speed infiniband 100Gb/s network

GPU generations from oldest (slowest) to newest currently available are: pascal volta turing ampere

You can also pick gpus by memory size using -C with feature names gmem11 gmem12 gmem16 gmem24 gmem32 gmemT48 gmem48 gmem80

--qos=gtest
Allows up to 8 gpus of any type, but job runtime is limited to an hour
-p preempt --qos preempt
No limits on number or type of gpus when they are idle, but your job will be killed to make room for other users when necessary
-p preempt --qos preempt --requeue --open-mode=append
As above, but restart the job when resources become available again.

The following options are available by request only after demonstrating effectiveness with gtest:

--qos medg
Use up to 4 gups at once
--qos highg
Use up to 8 gpus at once

obsolete and automatic options[edit]

-p gpu
without this option, an appropriate gpu partition is now selected automatically when you request a gpu

RTX issues[edit]

If your job works on pascal gpus but not the new RTX gpus your code may be having memory segmentation issues. Try this:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Old CUDA versions[edit]

If you get this error:

CUDA error: no kernel image is available for execution on the device

it indicates that you are using a version of cuda too old for the gpu hardware allocated to your job. There are exactly two solutions to this:

  1. Upgrade your code to use the latest version of cuda
  2. Ask slurm to give you an older gpu that supports the version of cuda you are using. (For example, sbatch -C '!ampere' )

Please note that the first option is preferred, as eventually the old version of cuda will become unavailable and you will be forced to upgrade anyway.

cuda last release issues
9 May 2018 won't compile; won't run on turing or ampere
10 Nov 2019 won't run on ampere gpus
11 October 2022 still supported (11.0 released March 2020)
12 current (released December 2022)

Additional software[edit]

These environments can be loaded with the module command.

Use module load XXX to enable software listed below:

module software versions available dependencies
cuda nVidia cuda library various
cudnn nVidia deep neural net library various cuda
matlab (ask if you need a different version) 2018a
torch help:torch cuda
tensorflow This module just loads cuda and cudnn 1.7 (anaconda) cuda cudnn
anaconda3 alternate python version
pytorch anaconda
julia julia *

Reserved gpus[edit]

When the cluster is full, complex priority rules become effective that try to balance use between all users.

To assist with this balance when the cluster is completely full, a number of gpus are reserved for short jobs and users who don't currently have jobs running. Options (described in detail in the next section) can take advantage of these reservations.

The '--qos short option allows use of reserved gpus for 3-5 hours. However, only one gpu of each type is reserved, so it may not be possible to get the reservation if you over constrain the job and are too picky about what gpu you get.

The -p short --qos short option gives access to a pool of pascal gpus reserved for short and interactive jobs and should almost always be available.

If a longer job is desired, each user can have one use of -p short --qos shortday to get a reserved pascal gpu for up to one day.

GPU usage limits[edit]

If you get a message like this:

Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

then you have tried to submit a job that exceeds either your account limits or fair use limits on the cluster.

To fairly distribute gpus between users, GPU use is by default limited by a gpu quota and a single node per job. If you need more, the following options are available. Please note, the --qos= option must be one of the first options.

-p preempt --qos preempt
No limits on number or type of gpus when they are idle, but your job will be killed to make room for other users when necessary
-p preempt --qos preempt --requeue --open-mode=append
As above, but restart the job when resources become available again. Note, this only works with sbatch. (Note: this may be buggy. Ask for help if it doesn't work for you and we can apply work arounds.)
--qos short
Limit runtime to 5h (preemptive after 3h) but allow larger number of interactive and batch jobs (and possibly use the immediate job reservation)
-p short
select gpus that are dedicated to short jobs, making it more likely to run sooner, DEFAULTS to --qos short
-p short --qos shortday
only one job per user may use this; one gpu, max one day runtime, but higher chance of starting sooner
--qos day
(Available to members of group1 only) This gives a slightly higher gpu quota than the default with a slightly higher priority, but the job becomes preemptive after 24 hours of runtime.
--time 3
00:00 : Limit runtime to 3 hours. Jobs shorter than 5h can use a limited time reservation to start immediately if resources are available


Note that only gpus with 16 or 32G are in the immediate short job reservation. Any request for a specific gpu model or memory size may cause the job to not be eligible for the immediate job reservation.


Note that the --requeue option assumes your job checkpoints internally and knows how to restart itself where it left off. If you need to start the job specially to restart, check $SLURM_RESTART_COUNT in your sbatch script. Ask us if you would like help getting checkpoint / restart working.

Interactive jobs (using srun) are limited. You may be able to get access to more gpus in your allocation by using sbatch instead.

Jobs are limited to a single node because special libraries are required to allow a job to take advantage of multiple nodes. If you believe your job can take advantage of multiple nodes, let us know and submit the job as preemptive. If you can show this works, multiple nodes per job can be enabled for your account. However, frequently there may not be enough free gpus to take advantage of this anyway.