Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA out of memory

See original GitHub issue

I’m training Document Information Extraction for custom Dataset of 100 train, 20 validation images. This is the config that I gave:

resume_from_checkpoint_path: null 
result_path: "./result"
pretrained_model_name_or_path: "naver-clova-ix/donut-base"
dataset_name_or_paths: ["/content/drive/MyDrive/donut_1.1"] # should be prepared from https://rrc.cvc.uab.es/?ch=17
sort_json_key: True
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [2560, 1920]
max_length: 128
align_long_axis: False
# num_nodes: 8 
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 10000
num_training_samples_per_epoch: 39463
max_epochs: 300
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 10
gradient_clip_val: 0.25
verbose: True

I’m getting this error with message:

RuntimeError: CUDA out of memory. Tried to allocate 76.00 MiB (GPU 0; 14.76 GiB total capacity; 13.48 GiB already allocated; 6.75 MiB free; 13.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried clearing torch cache using torch.cuda.empty_cache() Reducing the batch size didn’t help. I tried taking a smaller dataset, (50 train, 10 validation images), which is half of the earlier dataset, the memory allocation is same “76.00 MiB”

Is there any way that I can solve this issue? Please help!