Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`test` produces a warning when using `DDP`

See original GitHub issue

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning:

PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`,
it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated
exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some
samples to make sure all devices have same batch size in case of uneven inputs.

The problem is that the warning doesn’t adequately explain, how to fix this problem in all possible cases.

1. What if I am running `trainer.test` after `trainer.fit`?

Settings devices=1 in that case is not really a solution, because I want to use multiple GPUs for training. Creating a new Trainer instance also doesn’t quite work, because that would create a separate experiment (AFAIK?). For example, ckpt_path="best" wouldn’t work with a new Trainer instance, the Tensorboard logs will get segmented and so on.

Is it possible to use a different Strategy for tune, fit and test in a single Trainer? (btw, this might be useful even outside of this issue, as tune currently doesn’t work well with DDP)

2. What if I don’t care about `DistributedSampler` adding extra samples?

Please correct me, if I am wrong, but DistributedSampler should add at most num_devices - 1 extra samples. This means that unless you are using hundreds of devices or using extremely small datasets, the difference in metrics will probably be

a) less than the rounding precision b) less than the natural fluctuations due to random initialization and non-deterministic CUDA shenanigans

I think that bothering users with such a minor issue isn’t really desirable. Can this warning be silenced somehow?

3. Can this be fixed without requiring any changes from the users?

I found pytorch_lightning.overrides.distributed.UnrepeatedDistributedSampler, which allegedly solves this exact problem, but doesn’t work for training.

Does UnrepeatedDistributedSampler solve this issue? If it does, I think it should be at least mentioned in the warning and at best - used automatically during test instead of warning the user.

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

State:
Created a year ago
Comments:10 (4 by maintainers)

Top GitHub Comments

3reactions

awaelchlicommented, Aug 4, 2022

Hi @sounakdey It is on our roadmap. The issue where this was discussed is here: https://github.com/Lightning-AI/lightning/issues/3325 If we get support for uneven inputs plus using torchmetrics to compute metrics, trainer.test will be gold and should produce the exact same output regardless of how many GPUs are used to run it.

2reactions

sounakdeycommented, Aug 4, 2022

Just commenting because this is really important for the research community using Pytorch lightning … to make sure the results are reproducible… Are we still looking into this?

Top Results From Across the Web

test produces a warning when using DDP #12862 - GitHub

Trying to trainer.test with multiple GPUs (or even when using a single GPU with DDPStrategy) produces the following warning: PossibleUserWarning: Using ...

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

Use DDP which is more stable and at least 3x faster. Warning. DP only supports scattering and gathering primitive collections of tensors like...

PyTorch DDP: Finding the cause of "Expected to mark a ...

This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure...

Using Hydra + DDP - PyTorch Lightning

I get the following warning Missing logger folder: "1"/logs; I get the following error when running a fast_dev_run at test time: [Errno 2] ......

Distributed Data Parallel — PyTorch 1.13 documentation

Warning. The implementation of torch.nn.parallel. ... Linear as the local model, wraps it with DDP, and then runs one forward pass, one backward...