question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hangs up when finishing up a medium model training

See original GitHub issue

Describe the bug It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.

Screenshots

-----------------------------------------------------------------------------------------------------------
 validation results at iteration 320000 | lm_loss value: 2.567550E+00 | lm_loss_ppl value: 1.303385E+01 |
-----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
 validation results at the end of training for val data | lm_loss value: 2.536536E+00 | lm_loss_ppl value: 1.263582E+01 |
---------------------------------------------------------------------------------------------------------------------------
[2021-11-30 11:32:23,930] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../mp_rank_00_model_states.pt
[2021-11-30 11:32:43,555] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../ero_to_fp32.py
[2021-11-30 11:32:43,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_0_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,668] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,676] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_3_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,807] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,821] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,840] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,848] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_1_mp_rank_00_optim_states.pt

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sameeravithanacommented, Jan 7, 2022

This resolved with an update to DeepSpeed.

0reactions
sdtblckcommented, Dec 17, 2021

Huh, I’ve personally only ever experienced training hanging on multiple nodes.

Actually Signals.SIGSEGV: 11 suggests a segfault. Is this something you can reproduce repeatedly with the medium.yml configuration at 100 steps?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving and Loading Your Model to Resume Training in PyTorch
A simple PyTorch tutorial on how to resuming training deep learning models.
Read more >
How to Reduce the Training Time of Your Neural Network ...
It would surely be a pity to exclude a substantial part of your data for training, or wait for hours (even days) for...
Read more >
Use Early Stopping to Halt the Training of Neural Networks At ...
In this tutorial, you will discover the Keras API for adding early stopping to overfit deep learning neural network models.
Read more >
4. Model Training Patterns - Machine Learning Design ...
In Transfer Learning, we take part of a previously trained model, freeze the weights, and incorporate these nontrainable layers into a new model...
Read more >
Keras: Starting, stopping, and resuming training
Learning how to start, stop, and resume training a deep learning model is a super important skill to master — at some point...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found