question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange behavior using PyTorch DDP

See original GitHub issue

@1ytic Hi,

So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.

But when I use more than 1 device, the following happens:

  • On GPU-0 loss is calculated properly
  • On GPU-1 loss is close to zero for each batch

I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0 and other GPUs.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
csukuangfjcommented, May 14, 2022

Thanks for the heads up about the torchaudio loss!

@snakers4 You may find https://github.com/danpovey/fast_rnnt useful.

1reaction
burchimcommented, Jan 13, 2022

Yes, this means that logits / target lengths tensors do not match the logits / target tensors. If you have logits lengths longer than your logits tensor for instance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Weird behavior while evaluating using DDP - PyTorch Forums
I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to...
Read more >
Strange behavior in Pytorch
I saw a very strange behavior during training. ... DDP processes go into D status (disk sleep (uninterruptible)). Training procedure stuck.
Read more >
DistributedDataParallel behaves weirdly - PyTorch Forums
I tested a couple of hyperparameters and found weird behavior, which left me wondering if I oversaw something.
Read more >
Weird behavior when dealing with uneven ... - PyTorch Forums
Background Hi, I was trying to reproduce the tutorial given by https://tutorials.pytorch.kr/advanced/generic_join.html#what-is-join, ...
Read more >
How to fix randomness of dataloader in DDP? - PyTorch Forums
I'm using DDP and I hope that my data loader can generate precisely ... to get the get deterministic shuffling behavior. ... It's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found