when using multiple GPUs, `loss.mean()` may have subtle bias
See original GitHub issueThe problem is that, when the input is distributed to multiple GPUs, the input on each GPU may have different batch_size
.
For example, if you have 2 GPUs and the total batch_size is 13, then the batch_size
for each GPU will be 7 and 6 respectively, loss.mean()
will not give the exact loss. Although it may have little influence on the training of the model, it is not the exact result.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >Run Pytorch on Multiple GPUs - #63 by Brando_Miranda
Hello Just a noobie question on running pytorch on multiple GPU. If I simple specify this: device = torch.device("cuda:0"), this only runs on...
Read more >Train With Mixed Precision - NVIDIA Documentation Center
For multi-GPU training, the same strategy applies for loss scaling. NCCL supports both half precision floats and normal floats, therefore, ...
Read more >Multi-GPU Training Using PyTorch Lightning - Wandb
In this article, we take a look at how to execute multi-GPU training using PyTorch Lightning and visualize GPU performance in Weights &...
Read more >What should I do when my neural network doesn't learn?
There are two features of neural networks that make verification even more ... These bugs might even be the insidious kind for which...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
You could fix it with something like this
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.