About multi-gpu loss calculation
See original GitHub issueThanks for your nice work! I notice there is a mean()
when the program runs on multi-gpus, but there is not any gather-operation. In other words, the loss in
https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/main_pretrain.py#L332
is a scale but not a list of tensor. Am I right?
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
PyTorch Multi GPU: 3 Techniques Explained - Run:AI
There are three main ways to use PyTorch with multiple GPUs. These are: ... averages GPU-losses and performs a backward pass loss.mean().backward()
Read more >Efficient Training on Multiple GPUs - Hugging Face
To calculate the global batch size of the DP + PP setup we then do: mbs*chunks*dp_degree ( 8*32*4=1024 ). Let's go back to...
Read more >13.5. Training on Multiple GPUs - Dive into Deep Learning
Multiple GPUs, after all, increase both memory and computation ability. ... Each GPU calculates loss and gradient of the model parameters based on...
Read more >When calculate loss in model forward with multi-gpu training ...
Hi everyone, when I use F.nn_loss() in model forward as above. Then I two GPUs to train the model in form of model...
Read more >How to scale training on multiple GPUs - Towards Data Science
The loss function is calculated, comparing the predicted label with the ground-truth label; The backward pass is done, calculating the gradients ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For your reference, 0.13->0.02 and 0.12->0.09 at the two stages. They are not so exact due to the bad log caused by the machines’ problem. One more time, the convergent is more important.
Hi @forence, You are right. I am confused with
torch.nn.DataParallel
andtorch.nn.parallel.DistributedDataParallel
. Thank you to point it out. Themean()
is indeed redundant in our code. Thanks.