Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA out of memory while training VarNet

See original GitHub issue

Hello,

Since mid 2020, I have been using the fastMRI project, modifying the subsampling scripts to accommodate a custom undersampling pattern and compare the reconstructions to the Varden and equispaced undersampled data. Training and testing the UNet model in the remote GPU server has not been a problem, everything works well (as seen in the below image).

Figure_gt Figure_1

During the training of the VarNet model, I am encountering the following error after 7-9 iterations of the 1st epoch;

RuntimeError: CUDA out of memory Tried to allocate 28.00 MiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 11.76 MiB free; 9.89 GiB reserved in total by PyTorch)

I checked if anyone had raised a similar issue, but the closest thing I could find was https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649439714 and the partial solutionn suggested in https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649546413 was to “decreasing the size of the model - e.g., --num-cascades 4”.

I followed the suggestion and the model training runs without any errors, but the results look bad (as seen in the below image) because of reducing the model size?

Figure_1_2

I am training the model on 150 volumes of multi-coil brain datasets for 50 epochs, I would like to know how to tackle this problem. I kindly request you to provide some suggestions/solutions to overcome this issue

Ever since I pulled the project in mid 2020, I have been working with the same version of python libraries suggested in the requirement.txt file.

Environment Python3, torch 1.5.0, PyTorch-lightning 0.7.6 and torchvision 0.6.0

Desktop (Remote server): OS: Manjaro Linux 64bit / Linux 5.10.2-2-MANJARO Graphics: GeForce RTX 2080 Ti 10GB

Issue Analytics

State:
Created 2 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

adithyaOvGucommented, May 23, 2021

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.

Thanks a lot for the suggestions and support Mr.Muckley. I am now working on implementing the DDP with model parallel to the script and I will update the outcome here. Should we consider this issue closed?

1reaction

mmuckleycommented, May 21, 2021

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.