Data antipatterns and optimizations

Deep learning training sets, and vision datasets in particular have characteristics that fall at the extreme edge of what modern storage systems are designed to handle. This can cause severe performance issues both on the cluster and on local desktops.

The following guidelines should help mitigate these issues and improve performance of both your jobs and of other jobs on the cluster or on your local workstation.

These are guidelines only. If you are aware of other similar libraries that might help, they should be added here.

Problem[edit]

Today's modern storage systems are designed to handle small numbers of large files. Large numbers of small files do not perform well for either disks or network shared storage, and even impacts SSDs.

For the purposes of this discussion, a "large" dataset is one with more than 10,000 files OR larger than 200G.

Antipatterns[edit]

Don't unpack videos if you can use the video directly. Alternatives include (in no particular order):
- decord
- OpenCV
- torchvision
If your dataset comes in a tar file or an image, instead of unpacking the tar file, use a library to read it directly.
- webdataset
- https://filesystem-spec.readthedocs.io/en/latest/intro.html
If you can't use either decord or webdataset or a similar library and must unpack your images, ask for help to repack them as a squashfs. (It is faster to convert the zip or tar directly to a squashfs!!) SquashFS is read only, so finialize the dataset and make sure it is complete before getting it squashed.
Don't use more than 12 threads/cores per GPU
- Current cluster configuration allocates 12 cpu cores / threads per gpu. If you use more than that, you are starving other gpus of cpus, preventing their use.
- Using an excessive number of threads causes I/O contention between your own threads and will slow down your job.
Don't use much more threads than cores
- If all your threads have high cpu use, and you have more than one thread per cpu core, then the threads will all run slower. Don't overprovision threads unless you are sure the excess beyond cpu cores will be idle.
- Use fewer workers than threads, because frameworks like pytorch start a management thread that may use significant cpu in addition to your worker threads.
Don't just start multiple threads, make sure work gets distributed between them
- It doesn't matter how many threads you start if only one of them does any work. If the work isn't distributed, only one cpu core will be used.
Don't download on the login node. This slows down the login node for everyone, and slows down the storage node as well.
Don't run conda install on the login node. Use srun for this.
Don't use srun for long running jobs.
- If your login session is interrupted, your job will get killed. Use sbatch instead.
- If the login node is rotated (for clusters with multiple login nodes), you may lose access to your screen sessions. You can get them back by accessing the login node directly, but if it is rotated, it is about to be rebooted, so don't start new ones.
- Printing large amounts of debug messages can slow down your job. Doing this with srun makes it worse as your job may have to wait for the message to be passed to your terminal before continuing.

Optimizations[edit]

Use at least 4 threads per gpu, but you should benchmark more (up to 12) and less to find the optimum number for your code and data loader.
Use a good dataloader that can load images in parallel just ahead of use to keep the gpu busy.
- Make sure your data loader runs in multiple threads, and those threads in multiple cores, otherwise you get no performance increase from it.
- Use a data loader like NVIDIA DALI (see below) that can load directly from mp4 and jpg, decode in gpu, and partially process in gpu (measured 15x - 40x speedup!)
If you have low or zero cpu use jobs with high I/O (such as unzip and some downloaders), you can run them directly on the storage node. (on crcv, this is crcv2.eecs.ucf.edu or c3-0 internally)
- High memory and high cpu use and long running jobs should never be run on the storage node.
Use infiniband nodes ( -C infiniband ) for high I/O jobs that don't run on the storage node.
A shared conda venv might help, especially if it can be squashed.
- I am open to suggestions what should be in such a venv
The GPUs include hardware that can directly decode mpg and jpg videos. If you can decode and use the result directly on the gpu (rather than moving it back to the cpu first), this can help greatly. A gpu aware version of OpenCV has been compiled on the cluster and can do this.
It is unclear if webdataset use of tar files or squashed datasets is faster, but squashfs is probably faster for multiple reuse.
The system caches heavily used datasets, so whenever possible, share datasets rather than duplicating them!
- If you download a dataset to your home directory and would like to share it in /share/datasets ask for help moving it (this takes seconds if done right)
- Datasets in /share/datasets may be squashed to improve performance
If you have a large dataset that is not being modified and would like it squashed, let us know.
- Large datasets in home directories may also be squashed without warning.
Newer nVidia cards support half precision (16 bit / float) and lower, which can reduce gpu memory use and increase performance at the sacrifice of precision.
Some data formats are better than others.
- json is maximally portable but very large and inefficient
- pickle is binary and better than json, but not portable
- torch tensors are domain specific and can be loaded directly to the gpu and are significantly faster than pickle or numpy formats

External optimization guides[edit]

Pytorch tuning guide
NVIDIA DALI has some very good performance boosting pipelines
- DaLi / operations / Video processing examples, uses gpu video decoding hardware and tensor augmentation without involving cpu bottlenecks
- This can give 15x - 40x performance boost!!

Data transfer[edit]

Transfer large files and datasets directly to the storage node, not through the login node.
- The crcv storage node is externally crcv2.eecs.ucf.edu and internally c3-0
If you need to download a dataset to your local machine that has already been squashed, download the squash file instead of the individual files. The result will be much faster, and if you are in linux, you can directly mount the squashfs and use it instead of unpacking it.
- On your linux desktop, squashfuse can mount squashfs images. Alternately, ask for help to configure automounting them automatically.
- On your windows desktop, unpack with 7zip fork
- On the crcv cluster, things in /squash/file/ come from /share/ARC/squashfs/file.sq
- On the crcv cluster, many things in /share/datasets/ are symbolic linked to /squash/
Transfer data directly through the storage node (crcv2.eecs.ucf.edu) as much as possible.
If pulling from the cloud, rclone may help if it supports your cloud. This is installed only on the storage node. Other download tools include:
- rclone
- gdown

Additional reading and sources[edit]

These don't solve any problems but are good explanations.

Data antipatterns and optimizations

Contents

Problem[edit]

Antipatterns[edit]

Optimizations[edit]

External optimization guides[edit]

Data transfer[edit]

Additional reading and sources[edit]

Navigation menu

Data antipatterns and optimizations

Problem[edit]

Antipatterns[edit]

Optimizations[edit]

External optimization guides[edit]

Data transfer[edit]

Additional reading and sources[edit]

Navigation menu

Search