question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`test` produces a warning when using `DDP`

See original GitHub issue

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning:

PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.

The problem is that the warning doesn’t adequately explain, how to fix this problem in all possible cases.

1. What if I am running trainer.test after trainer.fit?

Settings devices=1 in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer instance also doesn’t quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best" wouldn’t work with a new Trainer instance, the Tensorboard logs will get segmented and so on.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer? (btw, this might be useful even outside of this issue, as tune currently doesn’t work well with DDP)

2. What if I don’t care about DistributedSampler adding extra samples?

Please correct me, if I am wrong, but DistributedSampler should add at most num_devices - 1 extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be

a) less than the rounding precision b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans

I think that bothering users with such a minor issue isn’t really desirable. Can this warning be silenced somehow?

3. Can this be fixed without requiring any changes from the users?

I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler, which allegedly solves this exact problem, but doesn’t work for training.

Does UnrepeatedDistributedSampler solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
awaelchlicommented, Aug 4, 2022

Hi @sounakdey It is on our roadmap. The issue where this was discussed is here: https://github.com/Lightning-AI/lightning/issues/3325 If we get support for uneven inputs plus using torchmetrics to compute metrics, trainer.test will be gold and should produce the exact same output regardless of how many GPUs are used to run it.

2reactions
sounakdeycommented, Aug 4, 2022

Just commenting because this is really important for the research community using Pytorch lightning … to make sure the results are reproducible… Are we still looking into this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

test produces a warning when using DDP #12862 - GitHub
Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning: PossibleUserWarning: Using ...
Read more >
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Use DDP which is more stable and at least 3x faster. Warning. DP only supports scattering and gathering primitive collections of tensors like...
Read more >
PyTorch DDP: Finding the cause of "Expected to mark a ...
This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure...
Read more >
Using Hydra + DDP - PyTorch Lightning
I get the following warning Missing logger folder: "1"/logs; I get the following error when running a fast_dev_run at test time: [Errno 2] ......
Read more >
Distributed Data Parallel — PyTorch 1.13 documentation
Warning. The implementation of torch.nn.parallel. ... Linear as the local model, wraps it with DDP, and then runs one forward pass, one backward...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found