Checkpoint fails in single node multi-GPU mode using DDP
See original GitHub issue🐛 Bug
Checkpoint fails in single node multi-GPU mode using DDP.
To Reproduce
python pl_examples/basic_examples/gpu_template.py --distributed_backend ddp --gpus 2
Epoch 2: : 700it [00:28, 42.69it/s, l/home/xz/anaconda3/envs/x/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "gpu_template.py", line 79, in <module>
main(hyperparams)
File "gpu_template.py", line 40, in main
trainer.fit(model)
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
self.run_pretrain_routine(model)
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
self.train()
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
self.run_training_epoch()
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
self.call_checkpoint_callback()
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
self.checkpoint_callback.on_validation_end(self, self.get_model())
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self._do_check_save(filepath, current, epoch)
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
self._del_model(delpath)
File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/home/xz/pytorch-lightning/pl_examples/basic_examples/lightning_logs/version_1/checkpoints/epoch=0.ckpt'
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Checkpoint fails in single node multi-GPU mode using DDP
Bug Checkpoint fails in single node multi-GPU mode using DDP. To Reproduce python pl_examples/basic_examples/gpu_template.py ...
Read more >Multi GPU training with DDP
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....
Read more >Efficient Training on Multiple GPUs
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Graphics Processing Unit (GPU) - PyTorch Lightning
If you request multiple GPUs or nodes without setting a mode, DDP Spawn will be automatically used. For a deeper understanding of what...
Read more >PyTorch - CC Doc
Tensor computation (like NumPy) with strong GPU acceleration ... maintainers to use multiple GPUs, whether they are all on a single node, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
fix for DDP checkpoint is in #1125, still waiting for it to be reviewed and merged.
as for this issue, on my side it seems to work fine. can you double check?
fix is in master?