question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when using multiple GPUs, `loss.mean()` may have subtle bias

See original GitHub issue

The problem is that, when the input is distributed to multiple GPUs, the input on each GPU may have different batch_size.

For example, if you have 2 GPUs and the total batch_size is 13, then the batch_size for each GPU will be 7 and 6 respectively, loss.mean() will not give the exact loss. Although it may have little influence on the training of the model, it is not the exact result.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/3fc63f126ddf883ba9659f13ec046c3639db7b7e/examples/run_squad.py#L1006-L1007

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
yaroslavvbcommented, May 14, 2019

You could fix it with something like this

batch_size = torch.tensor(data.shape[1]).to(device)
dist.all_reduce(batch_size, op=dist.ReduceOp.SUM)
dist.all_reduce(loss, op=dist.ReduceOp.SUM)
mean_loss = loss/batch_size
0reactions
stale[bot]commented, Jul 14, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Run Pytorch on Multiple GPUs - #63 by Brando_Miranda
Hello Just a noobie question on running pytorch on multiple GPU. If I simple specify this: device = torch.device("cuda:0"), this only runs on...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
For multi-GPU training, the same strategy applies for loss scaling. NCCL supports both half precision floats and normal floats, therefore, ...
Read more >
Multi-GPU Training Using PyTorch Lightning - Wandb
In this article, we take a look at how to execute multi-GPU training using PyTorch Lightning and visualize GPU performance in Weights &...
Read more >
What should I do when my neural network doesn't learn?
There are two features of neural networks that make verification even more ... These bugs might even be the insidious kind for which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found