Strange behavior using PyTorch DDP
See original GitHub issue@1ytic Hi,
So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.
But when I use more than 1 device, the following happens:
- On
GPU-0
loss is calculated properly - On
GPU-1
loss is close to zero for each batch
I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0
and other GPUs.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
Weird behavior while evaluating using DDP - PyTorch Forums
I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to...
Read more >Strange behavior in Pytorch
I saw a very strange behavior during training. ... DDP processes go into D status (disk sleep (uninterruptible)). Training procedure stuck.
Read more >DistributedDataParallel behaves weirdly - PyTorch Forums
I tested a couple of hyperparameters and found weird behavior, which left me wondering if I oversaw something.
Read more >Weird behavior when dealing with uneven ... - PyTorch Forums
Background Hi, I was trying to reproduce the tutorial given by https://tutorials.pytorch.kr/advanced/generic_join.html#what-is-join, ...
Read more >How to fix randomness of dataloader in DDP? - PyTorch Forums
I'm using DDP and I hope that my data loader can generate precisely ... to get the get deterministic shuffling behavior. ... It's...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@snakers4 You may find https://github.com/danpovey/fast_rnnt useful.
Yes, this means that logits / target lengths tensors do not match the logits / target tensors. If you have logits lengths longer than your logits tensor for instance.