Checkpointing fails to save config data because it is a `dict` not a `str`
See original GitHub issueDescribe the bug Saving model checkpoints fails with stack trace:
Traceback (most recent call last):^M
File "/mnt/nvme/home/dashiell/gpt-neox/train.py", line 27, in <module>^M
pretrain(neox_args=neox_args)^M
File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 106, in pretrain^M
iteration = train(^M
File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 613, in train^M
save_checkpoint(^M
File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 208, in save_checkpoint^M
save_ds_checkpoint(iteration, model, neox_args)^M
File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 201, in save_ds_checkpoint^M
f.write(config_data)^M
TypeError: write() argument must be str, not dict^M
To Reproduce
python3 deepy.py --conf_dir configs 1-3B.yml local_setup.yml
Expected behavior The checkpoint should save without failing.
Issue Analytics
- State:
- Created 9 months ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Saving Checkpoints during Training - PyKEEN - Read the Docs
To avoid this PyKEEN supports built-in check-points that allow a straight-forward saving of the current training loop state and resumption of a saved...
Read more >Training (tune.Trainable, session.report) — Ray 2.2.0
Saves the current model state to a Python object. It also saves to disk but does not return the checkpoint path. It does...
Read more >ValueError: Unable to save the object ListWrapper ... - GitHub
ValueError : Unable to save the object ListWrapper(...) (a list wrapper constructed to track trackable TensorFlow objects) when calling the method tf.keras.
Read more >6.2 Checkpointing
The first one, described above, concerns raw data. Raw data is saved in a separate file that belongs to the checkpoint file set....
Read more >Loading checkpoints when models built using a 'setup' block
Hello! I've seem to run into a situation where the recommended methods to load a model checkpoint fail. In my specific use case...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m on DeeperSpeed and GPT-NeoX main. I actually think this is an(other)
SlurmRunner
issue. It’s failing specifically with the version of the config files that are saved as strings in the arguments. Those specifically were giving me trouble when implementing the SlurmRunner and there’s Deep(er)Speed logic to basically clean them up. They get written back to strings to get passed in tosrun
as a command line arg, but it’s too much of a coincidence for me not to suspect it.Maybe make another issue about loading the checkpoints? Or comment on #673 ? That seems like something we should all be aware of
(Made an issue at #732 for the above error I described)!