question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

misc problems with saving the checkpoint

See original GitHub issue

setup - 2 local gpus: with the same default config I use everywhere

I have several issues with saving the model:

  1. the original fp32 model gets saved as fp16 model - how do I get back the fp32 model? The user may not proceed with deepspeed and will want to share the model, so it needs to be back in fp32. Perhaps there should be a save_fp32_model method? or an option in the checkpoint?

How does it even work on resume if fp32 weights aren’t getting saved?

  1. I use deepspeed.save_checkpoint(output_dir)

the first checkpoint gets saved it seems if I look at the filesystem, and then it hangs

  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1456 in _checkpoint_tag_validation
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1489 in save_checkpoint
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1212 in _save_checkpoint
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1185 in _maybe_log_save_evaluate
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1094 in train

(Thanks to @jeffra for the tip on py-spy!)

I obviously tried to disable the check and went on to discover the undocumented config option:

    "checkpoint": {
        "tag_validation": "ignore"
    },

which I reverse engineered. perhaps it could be documented? (warn/ignore/fail are the 3 options)

But I also didn’t use any tags…

I also tried to save only from the rank 0 process to no avail.

so when I add this, now it gets stuck in:

  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2427 in barrier
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1524 in _create_zero_checkpoint_files
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed-div-by-zero/deepspeed/runtime/engine.py", line 1496 in save_checkpoint
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1212 in _save_checkpoint
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1185 in _maybe_log_save_evaluate
  File "/mnt/nvme1/code/huggingface/transformers-ds-save-model/src/transformers/trainer.py", line 1094 in train

if I remove save_checkpoint the program doesn’t hang and completes just fine.

If I don’t save intermediary checkpoints and save only when the training is finished it hangs too (i.e. on the first call)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Feb 26, 2021
1reaction
tjruwasecommented, Feb 26, 2021

Got it. Thanks for the clarification. Unfortunately, there is no easy way to extract the fp32 weights inside the zero_pp_rank_* files. This makes it a TODO. Could please open an issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

R80.20 Log Management have problems with display properly
I have problems with Log Server R80.20 as screenshot. ... and the results from each command run has been saved to the following...
Read more >
Using a CHECKPOINT in SSIS packages to restart package ...
SaveCheckpoints: We can choose to save the checkpoint information or not. Let's do the following CHECKPOINT in SSIS package configuration:.
Read more >
Save model parameters on each checkpoint - Ray Tune
I would like to save the model (.pb, .h5) parameters on each checkpoint as we would like to compare the various stages of...
Read more >
Checkpoint Starvation - TV Tropes
Check-Point Starvation occurs when in a Video Game, the player must go for an extended period of time without Check Points or Save...
Read more >
Checkpoints - pymoo
import dill from pymoo.problems import get_problem from pymoo.algorithms.moo.nsga2 import NSGA2 from pymoo.optimize import minimize from ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found