Hangs up when finishing up a medium model training
See original GitHub issueDescribe the bug It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.
Screenshots
-----------------------------------------------------------------------------------------------------------
validation results at iteration 320000 | lm_loss value: 2.567550E+00 | lm_loss_ppl value: 1.303385E+01 |
-----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
validation results at the end of training for val data | lm_loss value: 2.536536E+00 | lm_loss_ppl value: 1.263582E+01 |
---------------------------------------------------------------------------------------------------------------------------
[2021-11-30 11:32:23,930] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../mp_rank_00_model_states.pt
[2021-11-30 11:32:43,555] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../ero_to_fp32.py
[2021-11-30 11:32:43,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_0_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,668] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,676] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_3_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,807] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,821] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,840] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,848] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_1_mp_rank_00_optim_states.pt
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Saving and Loading Your Model to Resume Training in PyTorch
A simple PyTorch tutorial on how to resuming training deep learning models.
Read more >How to Reduce the Training Time of Your Neural Network ...
It would surely be a pity to exclude a substantial part of your data for training, or wait for hours (even days) for...
Read more >Use Early Stopping to Halt the Training of Neural Networks At ...
In this tutorial, you will discover the Keras API for adding early stopping to overfit deep learning neural network models.
Read more >4. Model Training Patterns - Machine Learning Design ...
In Transfer Learning, we take part of a previously trained model, freeze the weights, and incorporate these nontrainable layers into a new model...
Read more >Keras: Starting, stopping, and resuming training
Learning how to start, stop, and resume training a deep learning model is a super important skill to master — at some point...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This resolved with an update to DeepSpeed.
Huh, I’ve personally only ever experienced training hanging on multiple nodes.
Actually
Signals.SIGSEGV: 11
suggests a segfault. Is this something you can reproduce repeatedly with the medium.yml configuration at 100 steps?