how to use all_gather in training loop?
See original GitHub issueI have defined my train_step in the exact same way as in the cifar10 example. Is it possible to gather all of the predictions before computing the loss? I haven’t seen examples of this pattern in the ignite examples (maybe I’m missing it?), but for my application, it is more optimal to compute the loss after aggregating the forward passes and targets run on multiple GPU’s. This only matters when using DistributedDataParallel
, since DataParallel
automatically aggregates the outputs.
I see the idist.all_gather()
function, but am unclear how to use it in a training loop.
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
MPI Scatter, Gather, and Allgather - MPI Tutorial
Although the program is quite simple, it demonstrates how one can use MPI to divide work across processes, perform computation on subsets of...
Read more >API - Horovod documentation - Read the Docs
name – Optional name to use during allgather, will default to the class type. ... Optional custom function to execute within the training...
Read more >Distributed Training - Determined AI Documentation
In the training code, use the allgather primitive to do a “distributed” ... The rule also applies to the conditional save after the...
Read more >Operation Semantics | XLA - TensorFlow
AfterAll; AllGather; AllReduce; AllToAll; BatchNormGrad ... infed causes the while loop to iterate more times on one replica than another.
Read more >Communication of receiver-selective data using MPI_Allgather ...
FIGURE 6 | Communication of spike data using MPI_Alltoall for the. ... fast re-configuration for hardware-in-the-loop training, applications for the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance
https://github.com/Spijkervet/SimCLR/blob/cd85c4366d2e6ac1b0a16798b76ac0a2c8a94e58/simclr/modules/nt_xent.py#L7
This might give you some inspiration.
Hi @vfdev-5, sure.
We are using the Supervised Contrastive loss to train an embedding. In Eq. 2 of the paper, we see that the loss depends on the number of samples used to compute it (positive and negative).
My colleague suggested to me that it is more optimal to compute the loss considering all examples (the entire batch), rather than considering
batch/ngpu
samples (which is what would happen when using DDP and computing loss locally to each GPU). This is because the denominator in SupConLoss is computing the loss of negative samples, and by first aggregating all of the negative samples across gpus, you would get a more accurate loss.