question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't use `estimated_stepping_batches` in `configure_optimizers` with DDP

See original GitHub issue

🐛 Bug

When using DDP and calling estimated_stepping_batches in configure_optimizers, an error is thrown. It happens because there’s an attempt to sync between the processes using the model’s device, but the model hasn’t been moved to a non-cpu device yet. https://github.com/PyTorchLightning/pytorch-lightning/blob/7ee690758ccad7f702460d056f6369c1d4371a46/pytorch_lightning/utilities/data.py#L124 The error:

  File "~/bug_report_model.py", line 36, in configure_optimizers
    self.trainer.estimated_stepping_batches
  File "~/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 2702, in estimated_stepping_batches
    self.reset_train_dataloader()
  File "~/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1848, in reset_train_dataloader
    if has_len_all_ranks(self.train_dataloader, self.strategy, module)
  File "~/pytorch-lightning/pytorch_lightning/utilities/data.py", line 124, in has_len_all_ranks
    total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum")
  File "~/pytorch-lightning/pytorch_lightning/strategies/ddp_spawn.py", line 224, in reduce
    tensor = sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
  File "~/pytorch-lightning/pytorch_lightning/utilities/distributed.py", line 95, in sync_ddp_if_available
    return sync_ddp(result, group=group, reduce_op=reduce_op)
  File "~/pytorch-lightning/pytorch_lightning/utilities/distributed.py", line 129, in sync_ddp
    torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
  File "~/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

To Reproduce

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        self.trainer.estimated_stepping_batches  # Can be used here to define the LR scheduler 
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run(ckpt_path=None):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        num_sanity_val_steps=0,
        max_epochs=1,
        gpus=2,
        logger=False,
        enable_checkpointing=False
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): master
  • PyTorch Version (e.g., 1.10): 1.10
  • Python version (e.g., 3.9): 3.9

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
Yevgnencommented, May 25, 2022

Any updates?

2reactions
rohitgr7commented, Apr 10, 2022

Deepspeed strategy is inherited from DDP so it will be a problem for all DDP related strategy. I’ll update the PR to make sure it gets merged soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't use `estimated_stepping_batches` in ... - GitHub
When using DDP and calling estimated_stepping_batches in configure_optimizers , an error is thrown. It happens because there's an attempt to ...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
When training using a strategy that splits data from each batch across GPUs, sometimes you might need to aggregate them on the main...
Read more >
Pytorch-Lightning Misconfiguration Exception; The closure ...
I have very properly defined the configure_optimizers() method in the trainer and it works for every other model (say, LSTM, GRU, ...
Read more >
norse.task.cifar10 module - GitHub Pages
When there are schedulers in which the .step() method is conditioned on a value, such as the torch.optim.lr_scheduler.ReduceLROnPlateau scheduler, Lightning ...
Read more >
Pytorch Lightning 完全攻略 - 知乎专栏
正常为1,训练1个epoch测试4次是0.25,每1000 batch测试一次是1000。 use (float) to check within a training epoch:此时这个值为一个epoch的百分比。每 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found