Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Early stopping fails on horovod with cannot unpack non-iterable NoneType object

See original GitHub issue

🐛 Bug

When I do early stopping with horovod distributed training, it fails with cannot unpack non-iterable NoneType object in tqdm. If fails only on some sets of training data. Also I see from the logs that early stopping was initiated only three times, while I’m training on 4 workers. This makes me feel like the problem is that one of the workers did not initiate early stopping - presumably because each worker decides not by averaged, but by local validation loss.

        result = pl.EvalResult(early_stop_on=loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)

As you can see I’m asking pytorch-lightning to average validation loss, but as was the case in my previous issue #3338 , the problem seems to be related to earlystopping using another dict. Here’s the full error message

Epoch 7:   0% 0/8 [00:00<?, ?it/s, loss=0.480, v_num=50]Traceback (most recent call last):
  File "main.py", line 72, in <module>       
    main()
  File "main.py", line 68, in main
    trainer.fit(model, data_module)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1016, in fit
    results = self.accelerator_backend.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/accelerators/horovod_backend.py", line 108, in train
    result = self.trainer.run_pretrain_routine(self.trainer.model)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in run_pretrain_routine
    self.train()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 396, in train
    self.run_training_epoch()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 484, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 832, in run_training_batch
    opt_closure_result = self.optimizer_closure(
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 1065, in optimizer_closure
    model_ref.backward(self, closure_loss, optimizer, opt_idx)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 312, in backward
    loss.backward()
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
Exception ignored in: <function tqdm.__del__ at 0x2b61156e6820>
Traceback (most recent call last):
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1086, in __del__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1293, in close
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1471, in display
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1089, in __repr__
  File "/home/aca10027xu/.local/lib/python3.8/site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Without early stopping it works ok. 

### Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
  - Tesla V100-SXM2-16GB
- available: True
- version: 10.2
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.1rc1
- tensorboard: 2.2.0
- tqdm: 4.46.1
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.2
- version: #1 SMP Fri Apr 20 16:44:24 UTC 2018

also had it on lightning 0.9.0 - actually did an upgrade to rc hoping that it will magically fix the problem

Issue Analytics

State:
Created 3 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

tgaddaircommented, Sep 21, 2020

Hey @undertherain, sorry for the late response. Tried looking into this earlier but couldn’t repro. I suspect there may be a few things going on here:

Early stopping criteria being applied independently on each worker. We should implement PL metrics aggregation for Horovod so the criteria can be applied consistently on every worker.
Something appears to be wrong with the params in the DistributedOptimizer. Sounds like this is related but not clear to me how without being able to repro myself (see https://discuss.pytorch.org/t/how-to-do-only-one-forward-propagation-per-epoch-and-multiple-backward-propagations-on-graph/65396/9 for more context).

@undertherain if you can provide a minimal repro that, will help a lot. I will also try to prioritize getting the metrics aggregation for Horovod to work in PL, which may also address this issue as a side effect.

0reactions

undertheraincommented, Oct 5, 2020

Yes it seems to work! Thanks!