question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gan.py multi-gpu running problems

See original GitHub issue

Running gan.py example with Trainer(ngpus=2) causes two types of error:

  1. if Trainer(ngpus=2, distributed_backend='dp')
Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
  File "/home/user/gan.py", line 146, in training_step
    self.discriminator(self.generated_imgs.detach()), fake)
  1. if Trainer(ngpus=2, distributed_backend='ddp')
  • in ./lightling_logs one run creates two folders: version_0 and version_1
  • Exception caused:
    File “/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py”, line 122, in _del_model os.remove(filepath) FileNotFoundError: [Errno 2] No such file or directory: ‘/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt’

it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.

Environment version:

python 3.7.5 pytorch 1.4.0
pytorch-lightning 0.7.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
lobantseffcommented, Apr 22, 2020

@Borda I assume to fix it by May

2reactions
axkoenigcommented, May 28, 2020

@armavox Any updates on this? Having the same issue…

Read more comments on GitHub >

github_iconTop Results From Across the Web

gan.py multi-gpu running problems · Issue #1223 - GitHub
it seems that each subprocess tries to create its own checkpoints and delete not ctrated one. Environment version: python 3.7.5 pytorch 1.4.0
Read more >
How to stop this error in multi gpu custom GAN? - Stack Overflow
I have to devide my input batchsize by two for the model, but for creating the dataset I can keep the batch_size the...
Read more >
Multi GPU Model Training: Monitoring and Optimizing
When it comes to training huge models with large datasets on multiple GPUs, we might run across some problems with memory or performance ......
Read more >
Using a Multi-GPU node to accelerate the training of Pix2Pix ...
To perform PyTorch training, several nodes with multiple GPUs are used, reducing the training time by up to 4 times. In VAE-GAN [23], ......
Read more >
Basics of multi-GPU — SpeechBrain 0.5.0 documentation
The common pattern for using multi-GPU training over a single machine with Data Parallel is: > cd recipes/<dataset>/<task>/ > python experiment.py ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found