question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA out of memory while training VarNet

See original GitHub issue

Hello,

Since mid 2020, I have been using the fastMRI project, modifying the subsampling scripts to accommodate a custom undersampling pattern and compare the reconstructions to the Varden and equispaced undersampled data. Training and testing the UNet model in the remote GPU server has not been a problem, everything works well (as seen in the below image).

Figure_gt Figure_1

During the training of the VarNet model, I am encountering the following error after 7-9 iterations of the 1st epoch;

RuntimeError: CUDA out of memory Tried to allocate 28.00 MiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 11.76 MiB free; 9.89 GiB reserved in total by PyTorch) 

I checked if anyone had raised a similar issue, but the closest thing I could find was https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649439714 and the partial solutionn suggested in https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649546413 was to “decreasing the size of the model - e.g., --num-cascades 4”.

I followed the suggestion and the model training runs without any errors, but the results look bad (as seen in the below image) because of reducing the model size?

Figure_1_2

I am training the model on 150 volumes of multi-coil brain datasets for 50 epochs, I would like to know how to tackle this problem. I kindly request you to provide some suggestions/solutions to overcome this issue

Ever since I pulled the project in mid 2020, I have been working with the same version of python libraries suggested in the requirement.txt file.

Environment Python3, torch 1.5.0, PyTorch-lightning 0.7.6 and torchvision 0.6.0

Desktop (Remote server): OS: Manjaro Linux 64bit / Linux 5.10.2-2-MANJARO Graphics: GeForce RTX 2080 Ti 10GB

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
adithyaOvGucommented, May 23, 2021

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.

Thanks a lot for the suggestions and support Mr.Muckley. I am now working on implementing the DDP with model parallel to the script and I will update the outcome here. Should we consider this issue closed?

1reaction
mmuckleycommented, May 21, 2021

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ValueError: when running VarNet · Issue #44 - GitHub
I get the following ValueError when I attempt to run the VarNet. Any idea why? ... RuntimeError: CUDA out of memory.
Read more >
How to solve this question "RuntimeError: CUDA out of ...
Since your GPU is running out of memory, you can try few things: 1.) Reduce your batch size. 2.) Reduce your network size....
Read more >
Cuda out of memory error - Intermediate - Hugging Face Forums
I encounter the below error when I finetune my dataset on mbart RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU...
Read more >
Solving "CUDA out of memory" Error - Kaggle
If you try to train multiple models on GPU, you are most likely to encounter some error similar to this one: RuntimeError: CUDA...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found