Training stalls with DDP multi-GPU setup
See original GitHub issue🐛 Bug
My training / validation step gets hung when using ddp on 4-GPU AWS instance. Usually it happens at the end of the first epoch, but sometimes in the middle of it. Code runs fine on 1 GPU. My model checkpoint is a very basic set up
checkpoint_callback = pl.callbacks.ModelCheckpoint(
args.checkpointdir,
save_last=True)
as is the trainer
trainer = pl.Trainer(
progress_bar_refresh_rate=1000,
log_every_n_steps=1000,
max_epochs=model_config['epochs'],
gradient_clip_val=0.5,
gpus=-1,
accelerator='ddp',
plugins=[pl.plugins.DDPPlugin(find_unused_parameters=True)],
callbacks=[checkpoint_callback])
I know there is a related issue https://github.com/PyTorchLightning/pytorch-lightning/issues/4612, but in my case the hanging happens non-deterministically.
Funnily if I use subset of data using --limit_train_batches
the trains runs fine. However, I monitor GPU mem usage and it never goes above 91/92%.
Any suggestions would be most appreciated.
Is there a way to at least induce an error message and failure. For example, on AWS SageMaker, stalled model does not fail the job and it continues accumulating costs. I do not want to use other parallel backends as they are much slower making 4-GPU parallelism cost-ineffective.
Expected behavior
Model runs in multi-gpu DDP model without stalling.
Environment
Using AWS p3* instances
- CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0 (also tried 1.6.0)
- pytorch-lightning: 1.2.3
- tqdm: 4.57.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (5 by maintainers)
I’ve also faced a problem with code stuck on “initializing ddp”. After much work, I solved my problem by simply adding
num_sanity_val_steps=0
as an argument toTrainer(...)
Same issue, it doesn’t even start for me.
This is my hardware (2 GPU’s on one machine):
My code (the MNIST example from this tutorial):
However, when running this in a Jupyter notebook, all I get is:
UPDATE: the following work:
But