question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Early stopping fails on horovod with cannot unpack non-iterable NoneType object

See original GitHub issue

🐛 Bug

When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm. If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I’m training on 4 workers. This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.

        result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)

As you can see I’m asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict. Here’s the full error message

Epoch 7:   0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
  File "main.py", line 72, in <module>       
    main()
  File "main.py", line 68, in main
    trainer.fit(model, data_module)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
    results = self.accelerator_backend.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
    result = self.trainer.run_pretrain_routine(self.trainer.model)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
    opt_closure_result = self.optimizer_closure(
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
    loss.backward()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Without early stopping it works ok. 

### Environment
  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.19.1
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 0.9.1rc1
    • tensorboard: 2.2.0
    • tqdm: 4.46.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.2
    • version: #1 SMP Fri Apr 20 16:44:24 UTC 2018
also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
tgaddaircommented, Sep 21, 2020

Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn’t repro. I suspect there may be a few things going on here:

  1. Early stopping criteria being applied independently on each worker. We should implement PL metrics aggregation for Horovod so the criteria can be applied consistently on every worker.
  2. Something appears to be wrong with the params in the DistributedOptimizer. Sounds like this is related but not clear to me how without being able to repro myself (see https://discuss.pytorch.org/t/how-to-do-only-one-forward-propagation-per-epoch-and-multiple-backward-propagations-on-graph/65396/9 for more context).

@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect.

0reactions
undertheraincommented, Oct 5, 2020

Yes it seems to work! Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Early stopping fails on horovod with cannot unpack ... - GitHub
Bug When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm.
Read more >
Typeerror: cannot unpack non-iterable nonetype object
This is a process known as unpacking. During the process of unpacking items in iterable objects, you may get an error that says:...
Read more >
Error in python: "cannot unpack non-iterable NoneType object"
It's a simple question, I'm just curious lol. python · input · typeerror · nonetype · iterable-unpacking · Share.
Read more >
Python TypeError: cannot unpack non-iterable NoneType object
If you try to assign a None value to a variable by using this syntax then it throws error as “TypeError: Can not...
Read more >
TypeError: cannot unpack non-iterable NoneType object
The Python TypeError: cannot unpack non-iterable NoneType object error is raised when you try to unpack a None value. On Career Karma, learn ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found