Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Checkpointing fails to save config data because it is a `dict` not a `str`

See original GitHub issue

Describe the bug Saving model checkpoints fails with stack trace:

Traceback (most recent call last):^M
  File "/mnt/nvme/home/dashiell/gpt-neox/train.py", line 27, in <module>^M
    pretrain(neox_args=neox_args)^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 106, in pretrain^M
    iteration = train(^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/training.py", line 613, in train^M
    save_checkpoint(^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 208, in save_checkpoint^M
    save_ds_checkpoint(iteration, model, neox_args)^M
  File "/mnt/nvme/home/dashiell/gpt-neox/megatron/checkpointing.py", line 201, in save_ds_checkpoint^M
    f.write(config_data)^M
TypeError: write() argument must be str, not dict^M

To Reproduce python3 deepy.py --conf_dir configs 1-3B.yml local_setup.yml

Expected behavior The checkpoint should save without failing.

Issue Analytics

State:
Created 9 months ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

dashstandercommented, Dec 8, 2022

I’m on DeeperSpeed and GPT-NeoX main. I actually think this is an(other) SlurmRunner issue. It’s failing specifically with the version of the config files that are saved as strings in the arguments. Those specifically were giving me trouble when implementing the SlurmRunner and there’s Deep(er)Speed logic to basically clean them up. They get written back to strings to get passed in to srun as a command line arg, but it’s too much of a coincidence for me not to suspect it.

Maybe make another issue about loading the checkpoints? Or comment on #673 ? That seems like something we should all be aware of

0reactions

haileyschoelkopfcommented, Dec 8, 2022

(Made an issue at #732 for the above error I described)!

Top Results From Across the Web

Saving Checkpoints during Training - PyKEEN - Read the Docs

To avoid this PyKEEN supports built-in check-points that allow a straight-forward saving of the current training loop state and resumption of a saved...

Training (tune.Trainable, session.report) — Ray 2.2.0

Saves the current model state to a Python object. It also saves to disk but does not return the checkpoint path. It does...

ValueError: Unable to save the object ListWrapper ... - GitHub

ValueError : Unable to save the object ListWrapper(...) (a list wrapper constructed to track trackable TensorFlow objects) when calling the method tf.keras.

6.2 Checkpointing

The first one, described above, concerns raw data. Raw data is saved in a separate file that belongs to the checkpoint file set....

Loading checkpoints when models built using a 'setup' block

Hello! I've seem to run into a situation where the recommended methods to load a model checkpoint fail. In my specific use case...