`test` produces a warning when using `DDP`
See original GitHub issueTrying to trainer.test
with multiple GPUs (or even when using a single GPU with DDPStrategy
) produces the following warning:
PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.
The problem is that the warning doesn’t adequately explain, how to fix this problem in all possible cases.
1. What if I am running trainer.test
after trainer.fit
?
Settings devices=1
in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer
instance also doesn’t quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best"
wouldn’t work with a new Trainer
instance, the Tensorboard logs will get segmented and so on.
Is it possible to use a different Strategy
for tune
, fit
and test
in a single Trainer
? (btw, this might be useful even outside of this issue, as tune
currently doesn’t work well with DDP
)
2. What if I don’t care about DistributedSampler
adding extra samples?
Please correct me, if I am wrong, but DistributedSampler
should add at most num_devices - 1
extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be
a) less than the rounding precision b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans
I think that bothering users with such a minor issue isn’t really desirable. Can this warning be silenced somehow?
3. Can this be fixed without requiring any changes from the users?
I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler
, which allegedly solves this exact problem, but doesn’t work for training.
Does UnrepeatedDistributedSampler
solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test
instead of warning the user.
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Hi @sounakdey It is on our roadmap. The issue where this was discussed is here: https://github.com/Lightning-AI/lightning/issues/3325 If we get support for uneven inputs plus using torchmetrics to compute metrics,
trainer.test
will be gold and should produce the exact same output regardless of how many GPUs are used to run it.Just commenting because this is really important for the research community using Pytorch lightning … to make sure the results are reproducible… Are we still looking into this?