Off by something after restart?
See original GitHub issueHere’s a log of num_updates and ppl in one of the recent runs:
"45999 9.55"
"46000 9.65"
"46001 9.64"
"46002 9.41"
"46003 9.7"
"46004 9.4"
"46005 9.56"
"46006 9.34"
"46007 9.11"
"46008 9.44"
"46009 9.56"
"46010 9.24"
Then on restart reloading from 46000, I see:
"46310 9.46"
"46311 9.5"
(restart happens)
"46001 9.44"
"46002 9.64"
"46003 9.41"
"46004 9.7"
"46005 9.4"
"46006 9.56"
"46007 9.34"
"46008 9.11"
"46009 9.31"
"46010 9.56"
Initially I thought we were just logging off by one, but it seems like activation norm is different too.
Full json log comparing step 460001:
(before restart)
2022-10-15 12:51:21 | INFO | train_inner | {"epoch": 16, "actv_norm": "612.901", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.03", "loss": "3.269", "ppl": "9.64", "wps": "8339.1", "ups": "0", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46001", "lr": "7.11347e-05", "gnorm": "0.133", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:42:48 | INFO | train_inner | {"epoch": 16, "actv_norm": "590.863", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "6.8", "loss": "3.239", "ppl": "9.44", "wps": "7266", "ups": "0", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46001", "lr": "7.11347e-05", "gnorm": "0.131", "clip": "0", "train_wall": "20", "cuda_gb_allocated": "17.2", "cuda_gb_reserved": "49.1", "cuda_gb_free": "61.9", "wall": "0"}
Full json log comparing step 460002:
(before restart)
2022-10-15 12:51:28 | INFO | train_inner | {"epoch": 16, "actv_norm": "609.587", "pos_norm": "0.634", "tok_norm": "1.3", "emb_norm": "0.003", "docsperex": "7.13", "loss": "3.235", "ppl": "9.41", "wps": "293269", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46002", "lr": "7.11341e-05", "gnorm": "0.122", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:42:55 | INFO | train_inner | {"epoch": 16, "actv_norm": "616.229", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.03", "loss": "3.27", "ppl": "9.64", "wps": "287606", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46002", "lr": "7.11341e-05", "gnorm": "0.136", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "51.7", "cuda_gb_free": "61.5", "wall": "0"}
Full json log comparing step 46003:
(before restart)
2022-10-15 12:51:35 | INFO | train_inner | {"epoch": 16, "actv_norm": "608.311", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.01", "loss": "3.278", "ppl": "9.7", "wps": "293086", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46003", "lr": "7.11335e-05", "gnorm": "0.139", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:43:03 | INFO | train_inner | {"epoch": 16, "actv_norm": "608.412", "pos_norm": "0.634", "tok_norm": "1.3", "emb_norm": "0.003", "docsperex": "7.13", "loss": "3.235", "ppl": "9.41", "wps": "293511", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46003", "lr": "7.11335e-05", "gnorm": "0.129", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "51.7", "cuda_gb_free": "61.5", "wall": "0"}
Running off of metaseq 7828d72815a9a581ab47b95876d38cb262741883
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
How to Recover Files Lost or Missing After Restarting Computer
Your computer restarted and everything is gone? You can restore desktop missing files by using System Restore and EaseUS file recovery ...
Read more >Synonyms of restart - Merriam-Webster Thesaurus
Synonyms for RESTART: resume, reopen, continue, renew, revive, proceed (with), resuscitate, pick up; Antonyms of RESTART: finish, complete, end, conclude, ...
Read more >What's the Difference Between Restarting and Shutting Down ...
In older versions of Windows, Restart and Shut Down did the same thing, closing down programs and powering off the machine.
Read more >"Turn off the display" timer got reset after restart / switch off PC
2) Display timeout has been moved to Settings > System > Power & Sleep > Additional Power Settings, on active Power Plan choose...
Read more >Fix an Android device that's restarting or crashing
Troubleshoot apps on your phone · Step 2: Check if the problem goes away · Step 3: Restart your phone normally & check...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Reading the code and thinking about what could be wrong, I found a bug in the pathway: https://github.com/facebookresearch/metaseq/pull/424
However, I don’t know if that is the same cause that happened in this run so I still should look at the logs and checkpoint.
The off by one in the docsperex nightly suggests a possible similar problem to https://github.com/facebookresearch/metaseq/commit/5e696d39cfb8c01cd3f502c7118fea78aac0e17e where we end up pulling from the workers in the wrong order. As long I have the checkpoint with the token counts in it, then I can quickly fast-forward the workers and try to figure out how to get the docsperex to match.