Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Strange behavior using PyTorch DDP

See original GitHub issue

So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.

But when I use more than 1 device, the following happens:

On GPU-0 loss is calculated properly
On GPU-1 loss is close to zero for each batch

I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0 and other GPUs.

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

1reaction

csukuangfjcommented, May 14, 2022

Thanks for the heads up about the torchaudio loss!

@snakers4 You may find https://github.com/danpovey/fast_rnnt useful.

1reaction

burchimcommented, Jan 13, 2022

Yes, this means that logits / target lengths tensors do not match the logits / target tensors. If you have logits lengths longer than your logits tensor for instance.

Read more comments on GitHub >

Top Results From Across the Web

Weird behavior while evaluating using DDP - PyTorch Forums

I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to...

Strange behavior in Pytorch

I saw a very strange behavior during training. ... DDP processes go into D status (disk sleep (uninterruptible)). Training procedure stuck.

DistributedDataParallel behaves weirdly - PyTorch Forums

I tested a couple of hyperparameters and found weird behavior, which left me wondering if I oversaw something.

Weird behavior when dealing with uneven ... - PyTorch Forums

Background Hi, I was trying to reproduce the tutorial given by https://tutorials.pytorch.kr/advanced/generic_join.html#what-is-join, ...

How to fix randomness of dataloader in DDP? - PyTorch Forums

I'm using DDP and I hope that my data loader can generate precisely ... to get the get deterministic shuffling behavior. ... It's...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Auto completion doesn't work for .yml in idea 2020.3

PyTorch 1.9 Support