Help:Crcv
- Home page
- http://crcv.ucf.edu/
- Status
- http://crcv.eecs.ucf.edu/
- Research focus
- Research in Computer Vision
- Operating System
- Ubuntu
See also:
Guides[edit]
- This is a cluster running Ubuntu linux
- Please use the help:slurm batch queue system to run your jobs on the cluster and share resources fairly with other users.
- Access to this cluster is through ssh which is native to MacOS and Linux, and has windows clients available
- You can see the load and status of this cluster with help:ganglia and the GPU heatmap
- This cluster currently has several gpu nodes. Use the slurm options to check for number and models and the correct options to request gpus for jobs.
- Let us know if you need additional software (listed at the bottom of nvidia) not listed here.
- Read directions below to access software available through modules
- See also Data antipatterns and optimizations
Datasets[edit]
Many datasets are already on the crcv cluster. Most datasets are in /share/datasets/
- Before you download a dataset, please check to see if it is already downloaded. Check with others if you can't find what you want.
- Do not copy datasets, instead use them where they are. Copying is slow, resource intensive, and unnecessary.
- Many datasets are optimized for fast access. Copying the dataset removes the optimization.
- Datasets are large. Duplicating the dataset wastes space. If you must, use symbolic links instead.
- Shared datasets share cache space, speeding up the whole system.
- If you do download a dataset, don't unpack it. Either use libraries or fuse modules to read the zip or tar directly, or convert it directly to a squashfs without unpacking it (ask for help).
- If you generate a dataset, try to keep the contents at each directory level between 100-1000 entries. Large numbers of items in a single directory will become exponentially slower.
example slurm script to use gpu[edit]
Create a startup script and save it, for example as myjob.slurm
#!/bin/bash #SBATCH -p gpu --gres=gpu:1 #SBATCH -c 4 echo CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES caffe.bin train --solver=solver.prototxt
Submit the job with:
module load cuda cudnn sbatch myjob.slurm
Note: you can also specify gpu type by selecting the right gres, for example: (pick one and replace the line above)
#SBATCH -p gpu --gres=gpu:pascal:1
There are two gpu partitions:
- -p short
- includes pascals and other <12G gpus. Max job length, 1 day, defaults to 3 hours. Suitable for short jobs and interactive debugging sessions.
- -p gpu
- includes higher end gpus that are in high demand
If neither of these options are given, the job will default to the CPU partition. If your job doesn't need a gpu, please use the cpu partition. Please run long running memory and cpu intensive jobs in the cpu partition rather than on the login node. (srun -c6 --pty
may be appropriate, especially for conda install
.)
slurm options[edit]
These slurm options are specific to the CRCV cluster.
- See also GPU usage limits below.
- See also generic slurm options
- Get a complete list of current features available with
sinfo -O features:
(Note: trailing colon is needed!) More detailed information:
sinfo -e --Format "Nodelist:.15,Features:.50,Gres:.30,Memory:.15,CPUs:.15"
These options allow you to pick nodes by specific features available:
- --gpus=1 -C 'pascal|turing|volta'
- select any gpu except ampere
- --gres:gpu:pascal:1
- select pascal gpu
- --gres:gpu:volta:1
- select volta gpu
- --gres:gpu:turing:1
- select turing gpu
- -C infiniband
- select node with high speed infiniband 100Gb/s network
GPU generations from oldest (slowest) to newest currently available are: pascal volta turing ampere
You can also pick gpus by memory size using -C with feature names gmem11 gmem12 gmem16 gmem24 gmem32 gmemT48 gmem48 gmem80
- --qos=gtest
- Allows up to 8 gpus of any type, but job runtime is limited to an hour
- -p preempt --qos preempt
- No limits on number or type of gpus when they are idle, but your job will be killed to make room for other users when necessary
- -p preempt --qos preempt --requeue --open-mode=append
- As above, but restart the job when resources become available again.
The following options are available by request only after demonstrating effectiveness with gtest:
- --qos medg
- Use up to 4 gups at once
- --qos highg
- Use up to 8 gpus at once
obsolete and automatic options[edit]
- -p gpu
- without this option, an appropriate gpu partition is now selected automatically when you request a gpu
RTX issues[edit]
If your job works on pascal gpus but not the new RTX gpus your code may be having memory segmentation issues. Try this:
from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)
Old CUDA versions[edit]
If you get this error:
CUDA error: no kernel image is available for execution on the device
it indicates that you are using a version of cuda too old for the gpu hardware allocated to your job. There are exactly two solutions to this:
- Upgrade your code to use the latest version of cuda
- Ask slurm to give you an older gpu that supports the version of cuda you are using. (For example,
sbatch -C '!ampere'
)
Please note that the first option is preferred, as eventually the old version of cuda will become unavailable and you will be forced to upgrade anyway.
cuda | last release | issues |
---|---|---|
9 | May 2018 | won't compile; won't run on turing or ampere |
10 | Nov 2019 | won't run on ampere gpus |
11 | October 2022 | still supported (11.0 released March 2020) |
12 | current | (released December 2022) |
Additional software[edit]
These environments can be loaded with the module command.
Use module load XXX to enable software listed below:
module | software | versions available | dependencies |
---|---|---|---|
cuda | nVidia cuda library | various | |
cudnn | nVidia deep neural net library | various | cuda |
matlab | (ask if you need a different version) | 2018a | |
torch | help:torch | cuda | |
tensorflow | This module just loads cuda and cudnn | 1.7 | (anaconda) cuda cudnn |
anaconda3 | alternate python version | ||
pytorch | anaconda | ||
julia | julia | * |
Reserved gpus[edit]
When the cluster is full, complex priority rules become effective that try to balance use between all users.
To assist with this balance when the cluster is completely full, a number of gpus are reserved for short jobs and users who don't currently have jobs running. Options (described in detail in the next section) can take advantage of these reservations.
The '--qos short option allows use of reserved gpus for 3-5 hours. However, only one gpu of each type is reserved, so it may not be possible to get the reservation if you over constrain the job and are too picky about what gpu you get.
The -p short --qos short option gives access to a pool of pascal gpus reserved for short and interactive jobs and should almost always be available.
If a longer job is desired, each user can have one use of -p short --qos shortday to get a reserved pascal gpu for up to one day.
GPU usage limits[edit]
If you get a message like this:
Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
then you have tried to submit a job that exceeds either your account limits or fair use limits on the cluster.
To fairly distribute gpus between users, GPU use is by default limited by a gpu quota and a single node per job. If you need more, the following options are available. Please note, the --qos= option must be one of the first options.
- -p preempt --qos preempt
- No limits on number or type of gpus when they are idle, but your job will be killed to make room for other users when necessary
- -p preempt --qos preempt --requeue --open-mode=append
- As above, but restart the job when resources become available again. Note, this only works with sbatch. (Note: this may be buggy. Ask for help if it doesn't work for you and we can apply work arounds.)
- --qos short
- Limit runtime to 5h (preemptive after 3h) but allow larger number of interactive and batch jobs (and possibly use the immediate job reservation)
- -p short
- select gpus that are dedicated to short jobs, making it more likely to run sooner, DEFAULTS to --qos short
- -p short --qos shortday
- only one job per user may use this; one gpu, max one day runtime, but higher chance of starting sooner
- --qos day
- (Available to members of group1 only) This gives a slightly higher gpu quota than the default with a slightly higher priority, but the job becomes preemptive after 24 hours of runtime.
- --time 3
- 00:00 : Limit runtime to 3 hours. Jobs shorter than 5h can use a limited time reservation to start immediately if resources are available
Note that only gpus with 16 or 32G are in the immediate short job reservation. Any request for a specific gpu model or memory size may cause the job to not be eligible for the immediate job reservation.
Note that the --requeue
option assumes your job checkpoints internally and knows how to restart itself where it left off.
If you need to start the job specially to restart, check $SLURM_RESTART_COUNT in your sbatch script. Ask us if you would like help getting checkpoint / restart working.
Interactive jobs (using srun) are limited. You may be able to get access to more gpus in your allocation by using sbatch instead.
Jobs are limited to a single node because special libraries are required to allow a job to take advantage of multiple nodes. If you believe your job can take advantage of multiple nodes, let us know and submit the job as preemptive. If you can show this works, multiple nodes per job can be enabled for your account. However, frequently there may not be enough free gpus to take advantage of this anyway.