gan.py multi-gpu running problems
See original GitHub issueRunning gan.py example with Trainer(ngpus=2) causes two types of error:
- if
Trainer(ngpus=2, distributed_backend='dp')
Exception has occurred: AttributeError
'NoneType' object has no attribute 'detach'
File "/home/user/gan.py", line 146, in training_step
self.discriminator(self.generated_imgs.detach()), fake)
- if
Trainer(ngpus=2, distributed_backend='ddp')
- in
./lightling_logs
one run creates two folders:version_0
andversion_1
- Exception caused:
File “/opt/miniconda3/envs/ctln-gan/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py”, line 122, in _del_model os.remove(filepath) FileNotFoundError: [Errno 2] No such file or directory: ‘/home/user/pyproj/DCGAN/lightning_logs/version_1/checkpoints/epoch=0.ckpt’
it seems that each subprocess tries to create its own checkpoints and delete not ctrated one.
Environment version:
python 3.7.5
pytorch 1.4.0
pytorch-lightning 0.7.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:9 (8 by maintainers)
Top Results From Across the Web
gan.py multi-gpu running problems · Issue #1223 - GitHub
it seems that each subprocess tries to create its own checkpoints and delete not ctrated one. Environment version: python 3.7.5 pytorch 1.4.0
Read more >How to stop this error in multi gpu custom GAN? - Stack Overflow
I have to devide my input batchsize by two for the model, but for creating the dataset I can keep the batch_size the...
Read more >Multi GPU Model Training: Monitoring and Optimizing
When it comes to training huge models with large datasets on multiple GPUs, we might run across some problems with memory or performance ......
Read more >Using a Multi-GPU node to accelerate the training of Pix2Pix ...
To perform PyTorch training, several nodes with multiple GPUs are used, reducing the training time by up to 4 times. In VAE-GAN [23], ......
Read more >Basics of multi-GPU — SpeechBrain 0.5.0 documentation
The common pattern for using multi-GPU training over a single machine with Data Parallel is: > cd recipes/<dataset>/<task>/ > python experiment.py ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@Borda I assume to fix it by May
@armavox Any updates on this? Having the same issue…