question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Checkpoint fails in single node multi-GPU mode using DDP

See original GitHub issue

🐛 Bug

Checkpoint fails in single node multi-GPU mode using DDP.

To Reproduce

python pl_examples/basic_examples/gpu_template.py --distributed_backend ddp --gpus 2
Epoch 2: : 700it [00:28, 42.69it/s, l/home/xz/anaconda3/envs/x/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown                                                                                                                                                                                                                 
  len(cache))
Traceback (most recent call last):
  File "gpu_template.py", line 79, in <module>
    main(hyperparams)
  File "gpu_template.py", line 40, in main
    trainer.fit(model)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    self.call_checkpoint_callback()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
    self._del_model(delpath)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/home/xz/pytorch-lightning/pl_examples/basic_examples/lightning_logs/version_1/checkpoints/epoch=0.ckpt'

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ghostcommented, Mar 12, 2020

fix for DDP checkpoint is in #1125, still waiting for it to be reviewed and merged.

I believe this happens with multiple gpus as well. And it only seems to happen if ModelChckepoint(save_top_k) is set greater than 1. Still converting models to 0.7.1 but wanted to share this …

as for this issue, on my side it seems to work fine. can you double check?

0reactions
sneimancommented, Mar 12, 2020

fix is in master?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Checkpoint fails in single node multi-GPU mode using DDP
Bug Checkpoint fails in single node multi-GPU mode using DDP. To Reproduce python pl_examples/basic_examples/gpu_template.py ...
Read more >
Multi GPU training with DDP
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....
Read more >
Efficient Training on Multiple GPUs
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Graphics Processing Unit (GPU) - PyTorch Lightning
If you request multiple GPUs or nodes without setting a mode, DDP Spawn will be automatically used. For a deeper understanding of what...
Read more >
PyTorch - CC Doc
Tensor computation (like NumPy) with strong GPU acceleration ... maintainers to use multiple GPUs, whether they are all on a single node, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found