question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent DDP unused parameters behavior

See original GitHub issue

🐛 Bug

Hi,

I don’t know if this error should be posted here or in Pytorch. I’m having the following error with strategy=ddp_find_unused_parameters_false

    reducer._rebuild_buckets()  # avoids "INTERNAL ASSERT FAILED" with `find_unused_parameters=False`
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

Basically, the whole generator is being neglected by the optimizer step. The crazy part is that when I use strategy=ddp I get:

[W reducer.cpp:1289] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag ...

So I don’t know if it’s wrong or not.

I’m using manual optimization on my training step as follow:

    def training_step(self, batch, batch_idx):
        x_mel, y_audio, y_mel, _, _ = batch

        y_audio = y_audio.unsqueeze(1)

        y_hat_audio = self.generator(x_mel)
        y_hat_mel = self.trainer.mel_spec.to(self.device)(
            y_hat_audio.squeeze(1), loss=True
        )

        # Optimize
        gen_opt, dis_opt = self.optimizers()

        ## Discriminator optimization step
        dis_loss = self.discriminator_loss(
            self.mpd(y_audio, y_hat_audio.detach()),
            self.msd(y_audio, y_hat_audio.detach()),
        )

        dis_opt.zero_grad()
        self.manual_backward(dis_loss)
        if self.trainer.clip_grad_val:
            clip_grad_norm_(self.mpd.parameters(), self.trainer.clip_grad_val)
            clip_grad_norm_(self.msd.parameters(), self.trainer.clip_grad_val)
        dis_opt.step()

        ## Generator optimization step
        gen_loss, mel_error = self.generator_loss(
            y_mel,
            y_hat_mel,
            self.mpd(y_audio, y_hat_audio),
            self.msd(y_audio, y_hat_audio),
        )

        gen_opt.zero_grad()
        self.manual_backward(gen_loss)
        if self.trainer.clip_grad_val:
            clip_grad_norm_(self.generator.parameters(), self.trainer.clip_grad_val)
        gen_opt.step()

        self.log_metric(dis_loss, gen_loss, mel_error, DatasetsTypes.TRAIN)

        if self.trainer.is_last_batch:
            for lrs in self.trainer.lr_schedulers_configs:
                lrs.scheduler.step()

Expected behavior

Be consistent and understand if parameters are being optimized.

Environment

* CUDA:
        - GPU:
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
                - NVIDIA GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.23.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0+cu102
        - pytorch-lightning: 1.6.4
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.10
        - version:           #137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
alealvcommented, Nov 1, 2022

Hi I finally found the problem. I’m hitting this error https://github.com/pytorch/pytorch/issues/61470

I’m closing this issue

1reaction
alealvcommented, Aug 17, 2022

I’m sorry for the late reply. I’ve been very busy and now it for holidays. I’ll try to find a way to reproduce it and post it.

Returning the trial loss didn’t help. I’m using manual optimization.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ddp_find_unused_parameters_f...
This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing...
Read more >
DistributedDataParallel — PyTorch 1.13 documentation
DDP will work as expected when there are no unused parameters in the model and each layer is checkpointed at most once (make...
Read more >
PyTorch 1.9.0 Now Available - Exxact Corporation
Highlights; Backwards Incompatible Change; Deprecations; New Features ... Log unused parameter names in DDP when crashing due to unused ...
Read more >
Changelog — PyTorch Lightning 1.8.5 documentation
Added support for auto wrapping for DDPFullyShardedNativeStrategy (#14252) ... Added args parameter to LightningCLI to ease running from within Python ...
Read more >
PyTorch: torch/nn/parallel/distributed.py - Fossies
DDP will work as 336 expected when there are no unused parameters in the model ... The module itself will 394 conduct gradient...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found