CUDA out of memory

From CECS wiki
Jump to navigation Jump to search

The error 'CUDA out of memory' has multiple causes and online guides you find with google tend to focus on one of these causes which might not be your particular problem.

This is a summary of known causes and solutions.

How to read this error[edit]

RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU 0; Y GiB total capacity; Z GiB already allocated; F GiB free; R GiB reserved in total by PyTorch)

The values X Y Z F R are numbers in a real error message.

Only X and F are meaningful for debugging this problem.

X
the chunk size your code is trying to allocate
F
the amount of memory remaining to fufill the allocation request

Solutions[edit]

There are several cases:

X > F
Your code is trying to use more memory than is available in the gpu. Try reducing batch size, reducing memory use, look for memory leaks, release unneeded values, and call GC. This is the most frequent answer found in google.
X is near F
Possibly your memory is fragmented. Freeing cache and calling GC may help. export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512' may reduce fragmentation for borderline cases. This is the second most common answer found in google.
X << F
There is lots of memory free, but CUDA can't allocate it anyway. This is most likely caused by slurm overconstraining memory use because you only requested one cpu with 8G of ram and you are using a larger GPU. Try requesting more cpu cores (up to 12 per gpu) to get the additional memory that comes with them.

External guides[edit]