Upgrading to 0.4.9 stuck multi-gpu training
See original GitHub issueš Bug description
Hi @vfdev-5 ,
We upgraded ignite from 0.4.8 to 0.4.9 in MONAI 0.9.1 recently: https://github.com/Project-MONAI/MONAI/pull/4605. Then got the issue report from user: something changed related to multi-gpu training between 0.9.1 and 0.9.0⦠monailabel multi-training is not working⦠SupervisedTrainer is getting stuck to run inference step to compute the loss⦠after debugging a bit⦠i see this is the problem⦠pytorch-ignite==0.4.8 vs pytorch-ignite==0.4.9 when I downgrade it, all is okā¦
Environment
- PyTorch Version (e.g., 1.4): 1.12.0
- Ignite Version (e.g., 0.3.0): 0.4.9
- OS (e.g., Linux): ubuntu
- How you installed Ignite (
conda
,pip
, source): pip - Python version: 3.8
- Any other relevant information: downgrade to 0.4.8 then everything goes fine
Issue Analytics
- State:
- Created a year ago
- Comments:11 (2 by maintainers)
Top Results From Across the Web
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... If you request multiple GPUs or nodes without setting a mode, DDP Spawn will...
Read more >Efficient Training on Multiple GPUs - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >The importance of hyperparameter tuning for scaling deep ...
Although upgrading the training code to use multiple GPUs is easy, the process requires tuning hyperparameters and applying optimization tricksĀ ...
Read more >Training trials in parallel on multi-gpu machine - Ray Tune
I have a cluster with 1 GPU node which has 4 GPUs, and bunch of other CPU nodes. How do I configure Tune...
Read more >CUDA: Out of memory error when using multi-gpu
I am facing a CUDA: Out of memory issue when using a batch size (per gpu) of 4 on 2 gpus. However training...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Here is the snippet that should help to reproduce the problem⦠Reference: https://github.com/Project-MONAI/tutorials/blob/main/acceleration/distributed_training/unet_training_workflows.py
REQUIREMENTS
CODE
create.py
multi.py
DUMMY DATASET
DO NOT WORK
WORKS
@SachidanandAlle thanks a lot for the repro code snippet !
This issue is related to still open issue https://github.com/pytorch/ignite/issues/2035 and particularly, more precisely to https://github.com/pytorch/ignite/issues/2035#issuecomment-855473589
It is by chance that 0.4.8 is working, due to
In 0.4.9 I removed a warning for DDP context in metric: https://github.com/pytorch/ignite/pull/2549 and thus ignite is fully unaware of DDP context and tries to set it up on rank zero only when using
DiskSaver
and thus gets stuck.@SachidanandAlle a quick workaround fix of the current code would be:
By the way, we can simplify the code a bit more using
ignite.distributed
package (and also fixing the issue).multi_updated.py
cc @sadra-barikbin and your PR https://github.com/pytorch/ignite/pull/2633