question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Off by something after restart?

See original GitHub issue

Here’s a log of num_updates and ppl in one of the recent runs:

"45999 9.55"
"46000 9.65"
"46001 9.64"
"46002 9.41"
"46003 9.7"
"46004 9.4"
"46005 9.56"
"46006 9.34"
"46007 9.11"
"46008 9.44"
"46009 9.56"
"46010 9.24"

Then on restart reloading from 46000, I see:

"46310 9.46"
"46311 9.5"

(restart happens)

"46001 9.44"
"46002 9.64"
"46003 9.41"
"46004 9.7"
"46005 9.4"
"46006 9.56"
"46007 9.34"
"46008 9.11"
"46009 9.31"
"46010 9.56"

Initially I thought we were just logging off by one, but it seems like activation norm is different too.


Full json log comparing step 460001:

(before restart)
2022-10-15 12:51:21 | INFO | train_inner | {"epoch": 16, "actv_norm": "612.901", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.03", "loss": "3.269", "ppl": "9.64", "wps": "8339.1", "ups": "0", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46001", "lr": "7.11347e-05", "gnorm": "0.133", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:42:48 | INFO | train_inner | {"epoch": 16, "actv_norm": "590.863", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "6.8", "loss": "3.239", "ppl": "9.44", "wps": "7266", "ups": "0", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46001", "lr": "7.11347e-05", "gnorm": "0.131", "clip": "0", "train_wall": "20", "cuda_gb_allocated": "17.2", "cuda_gb_reserved": "49.1", "cuda_gb_free": "61.9", "wall": "0"}

Full json log comparing step 460002:

(before restart)
2022-10-15 12:51:28 | INFO | train_inner | {"epoch": 16, "actv_norm": "609.587", "pos_norm": "0.634", "tok_norm": "1.3", "emb_norm": "0.003", "docsperex": "7.13", "loss": "3.235", "ppl": "9.41", "wps": "293269", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46002", "lr": "7.11341e-05", "gnorm": "0.122", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:42:55 | INFO | train_inner | {"epoch": 16, "actv_norm": "616.229", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.03", "loss": "3.27", "ppl": "9.64", "wps": "287606", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46002", "lr": "7.11341e-05", "gnorm": "0.136", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "51.7", "cuda_gb_free": "61.5", "wall": "0"}

Full json log comparing step 46003:

(before restart)
2022-10-15 12:51:35 | INFO | train_inner | {"epoch": 16, "actv_norm": "608.311", "pos_norm": "0.634", "tok_norm": "1.301", "emb_norm": "0.003", "docsperex": "7.01", "loss": "3.278", "ppl": "9.7", "wps": "293086", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46003", "lr": "7.11335e-05", "gnorm": "0.139", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "54.3", "cuda_gb_free": "61.5", "wall": "0"}
(after restart)
2022-10-15 13:43:03 | INFO | train_inner | {"epoch": 16, "actv_norm": "608.412", "pos_norm": "0.634", "tok_norm": "1.3", "emb_norm": "0.003", "docsperex": "7.13", "loss": "3.235", "ppl": "9.41", "wps": "293511", "ups": "0.14", "wpb": "2.09715e+06", "bsz": "1024", "num_updates": "46003", "lr": "7.11335e-05", "gnorm": "0.129", "clip": "0", "train_wall": "7", "cuda_gb_allocated": "17.7", "cuda_gb_reserved": "51.7", "cuda_gb_free": "61.5", "wall": "0"}

Running off of metaseq 7828d72815a9a581ab47b95876d38cb262741883

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
zdevitocommented, Oct 16, 2022

Reading the code and thinking about what could be wrong, I found a bug in the pathway: https://github.com/facebookresearch/metaseq/pull/424

However, I don’t know if that is the same cause that happened in this run so I still should look at the logs and checkpoint.

1reaction
zdevitocommented, Oct 16, 2022

The off by one in the docsperex nightly suggests a possible similar problem to https://github.com/facebookresearch/metaseq/commit/5e696d39cfb8c01cd3f502c7118fea78aac0e17e where we end up pulling from the workers in the wrong order. As long I have the checkpoint with the token counts in it, then I can quickly fast-forward the workers and try to figure out how to get the docsperex to match.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Recover Files Lost or Missing After Restarting Computer
Your computer restarted and everything is gone? You can restore desktop missing files by using System Restore and EaseUS file recovery ...
Read more >
Synonyms of restart - Merriam-Webster Thesaurus
Synonyms for RESTART: resume, reopen, continue, renew, revive, proceed (with), resuscitate, pick up; Antonyms of RESTART: finish, complete, end, conclude, ...
Read more >
What's the Difference Between Restarting and Shutting Down ...
In older versions of Windows, Restart and Shut Down did the same thing, closing down programs and powering off the machine.
Read more >
"Turn off the display" timer got reset after restart / switch off PC
2) Display timeout has been moved to Settings > System > Power & Sleep > Additional Power Settings, on active Power Plan choose...
Read more >
Fix an Android device that's restarting or crashing
Troubleshoot apps on your phone · Step 2: Check if the problem goes away · Step 3: Restart your phone normally & check...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found