Early stopping fails on horovod with cannot unpack non-iterable NoneType object
See original GitHub issue🐛 Bug
When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object
in tqdm.
If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I’m training on 4 workers.
This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.
result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
result.log("val_loss", loss, sync_dist=True)
As you can see I’m asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict. Here’s the full error message
Epoch 7: 0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
File "main.py", line 72, in <module>
main()
File "main.py", line 68, in main
trainer.fit(model, data_module)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
results = self.accelerator_backend.train()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
result = self.trainer.run_pretrain_routine(self.trainer.model)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
self.train()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
self.run_training_epoch()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
opt_closure_result = self.optimizer_closure(
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
model_ref.backward(self, closure_loss, optimizer, opt_idx)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
loss.backward()
File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Without early stopping it works ok.
### Environment
- CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
- GPU:
- Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.1rc1
- tensorboard: 2.2.0
- tqdm: 4.46.1
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.2
- version: #1 SMP Fri Apr 20 16:44:24 UTC 2018
also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (8 by maintainers)
Top Results From Across the Web
Early stopping fails on horovod with cannot unpack ... - GitHub
Bug When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
Read more >Typeerror: cannot unpack non-iterable nonetype object
This is a process known as unpacking. During the process of unpacking items in iterable objects, you may get an error that says:...
Read more >Error in python: "cannot unpack non-iterable NoneType object"
It's a simple question, I'm just curious lol. python · input · typeerror · nonetype · iterable-unpacking · Share.
Read more >Python TypeError: cannot unpack non-iterable NoneType object
If you try to assign a None value to a variable by using this syntax then it throws error as “TypeError: Can not...
Read more >TypeError: cannot unpack non-iterable NoneType object
The Python TypeError: cannot unpack non-iterable NoneType object error is raised when you try to unpack a None value. On Career Karma, learn ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn’t repro. I suspect there may be a few things going on here:
@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect.
Yes it seems to work! Thanks!