Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hangs up when finishing up a medium model training

See original GitHub issue

Describe the bug It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.

Screenshots

-----------------------------------------------------------------------------------------------------------
 validation results at iteration 320000 | lm_loss value: 2.567550E+00 | lm_loss_ppl value: 1.303385E+01 |
-----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
 validation results at the end of training for val data | lm_loss value: 2.536536E+00 | lm_loss_ppl value: 1.263582E+01 |
---------------------------------------------------------------------------------------------------------------------------
[2021-11-30 11:32:23,930] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../mp_rank_00_model_states.pt
[2021-11-30 11:32:43,555] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../ero_to_fp32.py
[2021-11-30 11:32:43,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_0_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,668] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,676] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_3_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,807] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,821] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,840] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,848] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_1_mp_rank_00_optim_states.pt