[Horovod] ModelCheckpoint and EarlyStopping CBs hit errors with Torch 1.13+
See original GitHub issueBug description
Since PyTorch 1.13, we have observed that ModelCheckpoint and EarlyStopping callbacks would hit an undefined symbol error with Horovod strategy.
Details and examples are in https://github.com/horovod/horovod/commit/e392eb9daf09d2c987497686ae82b62c924c9c27
It is reproducible with Torch 1.13
alone, but I think underneath, reduce_op
fromDDP
should be not mixed with Horovod
. This line in PTL hits the error.
How to reproduce the bug
from torch.distributed import ReduceOp
op = None
op in (ReduceOp.SUM, None)
Error messages and logs
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __eq__(): incompatible function arguments. The following argument types are supported:
1. (self: torch._C._distributed_c10d.ReduceOp, arg0: c10d::ReduceOp::RedOpType) -> bool
2. (self: torch._C._distributed_c10d.ReduceOp, arg0: torch._C._distributed_c10d.ReduceOp) -> bool
Invoked with: <torch.distributed.distributed_c10d.ReduceOp object at 0x7fba78c9e0b0>, None
Environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.13+
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
Comments and suggestions are welcome.
cc @awaelchli
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
pytorch-lightning - githubmemory
pytorch-lightning repo issues. ... [Horovod] ModelCheckpoint and EarlyStopping CBs hit errors with Torch 1.13+. carmocca.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
close issue since Torch side landed a fix. See https://github.com/pytorch/pytorch/issues/90072
That was fast ❤️ Great fix