RuntimeError: CUDA out of memory while training VarNet
See original GitHub issueHello,
Since mid 2020, I have been using the fastMRI project, modifying the subsampling scripts to accommodate a custom undersampling pattern and compare the reconstructions to the Varden and equispaced undersampled data. Training and testing the UNet model in the remote GPU server has not been a problem, everything works well (as seen in the below image).
During the training of the VarNet model, I am encountering the following error after 7-9 iterations of the 1st epoch;
RuntimeError: CUDA out of memory Tried to allocate 28.00 MiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 11.76 MiB free; 9.89 GiB reserved in total by PyTorch)
I checked if anyone had raised a similar issue, but the closest thing I could find was https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649439714 and the partial solutionn suggested in https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649546413 was to “decreasing the size of the model - e.g., --num-cascades 4”.
I followed the suggestion and the model training runs without any errors, but the results look bad (as seen in the below image) because of reducing the model size?
I am training the model on 150 volumes of multi-coil brain datasets for 50 epochs, I would like to know how to tackle this problem. I kindly request you to provide some suggestions/solutions to overcome this issue
Ever since I pulled the project in mid 2020, I have been working with the same version of python libraries suggested in the requirement.txt file.
Environment Python3, torch 1.5.0, PyTorch-lightning 0.7.6 and torchvision 0.6.0
Desktop (Remote server): OS: Manjaro Linux 64bit / Linux 5.10.2-2-MANJARO Graphics: GeForce RTX 2080 Ti 10GB
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Thanks a lot for the suggestions and support Mr.Muckley. I am now working on implementing the DDP with model parallel to the script and I will update the outcome here. Should we consider this issue closed?
For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.