question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Checkpointing fails to save config data because it is a `dict` not a `str`

See original GitHub issue

Describe the bug Saving model checkpoints fails with stack trace:

Traceback (most recent call last):^M
  File "/mnt/nvme/home/dashiell/gpt-neox/train.py", line 27, in <module>^M
    pretrain(neox_args=neox_args)^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 106, in pretrain^M
    iteration = train(^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 613, in train^M
    save_checkpoint(^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 208, in save_checkpoint^M
    save_ds_checkpoint(iteration, model, neox_args)^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 201, in save_ds_checkpoint^M
    f.write(config_data)^M
TypeError: write() argument must be str, not dict^M

To Reproduce python3 deepy.py --conf_dir configs 1-3B.yml local_setup.yml

Expected behavior The checkpoint should save without failing.

Issue Analytics

  • State:closed
  • Created 9 months ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dashstandercommented, Dec 8, 2022

I’m on DeeperSpeed and GPT-NeoX main. I actually think this is an(other) SlurmRunner issue. It’s failing specifically with the version of the config files that are saved as strings in the arguments. Those specifically were giving me trouble when implementing the SlurmRunner and there’s Deep(er)Speed logic to basically clean them up. They get written back to strings to get passed in to srun as a command line arg, but it’s too much of a coincidence for me not to suspect it.

Maybe make another issue about loading the checkpoints? Or comment on #673 ? That seems like something we should all be aware of

0reactions
haileyschoelkopfcommented, Dec 8, 2022

(Made an issue at #732 for the above error I described)!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving Checkpoints during Training - PyKEEN - Read the Docs
To avoid this PyKEEN supports built-in check-points that allow a straight-forward saving of the current training loop state and resumption of a saved...
Read more >
Training (tune.Trainable, session.report) — Ray 2.2.0
Saves the current model state to a Python object. It also saves to disk but does not return the checkpoint path. It does...
Read more >
ValueError: Unable to save the object ListWrapper ... - GitHub
ValueError : Unable to save the object ListWrapper(...) (a list wrapper constructed to track trackable TensorFlow objects) when calling the method tf.keras.
Read more >
6.2 Checkpointing
The first one, described above, concerns raw data. Raw data is saved in a separate file that belongs to the checkpoint file set....
Read more >
Loading checkpoints when models built using a 'setup' block
Hello! I've seem to run into a situation where the recommended methods to load a model checkpoint fail. In my specific use case...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found