Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when using multiple GPUs, `loss.mean()` may have subtle bias

See original GitHub issue

The problem is that, when the input is distributed to multiple GPUs, the input on each GPU may have different batch_size.

For example, if you have 2 GPUs and the total batch_size is 13, then the batch_size for each GPU will be 7 and 6 respectively, loss.mean() will not give the exact loss. Although it may have little influence on the training of the model, it is not the exact result.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/3fc63f126ddf883ba9659f13ec046c3639db7b7e/examples/run_squad.py#L1006-L1007

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

yaroslavvbcommented, May 14, 2019

You could fix it with something like this

batch_size = torch.tensor(data.shape[1]).to(device)
dist.all_reduce(batch_size, op=dist.ReduceOp.SUM)
dist.all_reduce(loss, op=dist.ReduceOp.SUM)
mean_loss = loss/batch_size

0reactions

stale[bot]commented, Jul 14, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.