question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Horovod] ModelCheckpoint and EarlyStopping CBs hit errors with Torch 1.13+

See original GitHub issue

Bug description

Since PyTorch 1.13, we have observed that ModelCheckpoint and EarlyStopping callbacks would hit an undefined symbol error with Horovod strategy.

Details and examples are in https://github.com/horovod/horovod/commit/e392eb9daf09d2c987497686ae82b62c924c9c27

It is reproducible with Torch 1.13 alone, but I think underneath, reduce_op fromDDP should be not mixed with Horovod. This line in PTL hits the error.

https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/strategies/horovod.py#L179

How to reproduce the bug

from torch.distributed import ReduceOp

op = None
op in (ReduceOp.SUM, None)

Error messages and logs

        Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
            1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
            2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
        Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.13+
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

Comments and suggestions are welcome.

cc @awaelchli

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
chongxiaoccommented, Dec 6, 2022

close issue since Torch side landed a fix. See https://github.com/pytorch/pytorch/issues/90072

0reactions
awaelchlicommented, Dec 6, 2022

That was fast ❤️ Great fix

Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch-lightning - githubmemory
pytorch-lightning repo issues. ... [Horovod] ModelCheckpoint and EarlyStopping CBs hit errors with Torch 1.13+. carmocca.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found