Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Not being able to save T5-11B checkpoint using deepspeed

See original GitHub issue

Describe the bug A clear and concise description of what the bug is. Not being able to save T5-11B checkpoint using deepspeed

To Reproduce Steps to reproduce the behavior:

export BS=12;
PYTHONPATH=../../../src
USE_TF=0

deepspeed --num_gpus=4 ./run_translation.py \
        --model_name_or_path  /local/nlp/temp/poetryT511B0/checkpoint-801 \
        --output_dir /local/nlp/temp/poetryT511B1 \
        --evaluation_strategy=steps \
        --save_strategy=epoch \
        --eval_steps 200 \
        --save_steps 200 \
        --do_train \
        --do_eval \
        --train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
        --validation_file /home/tuhin.chakr/gpt3/poetrynew/val.json \
        --learning_rate 1e-3 \
        --gradient_accumulation_steps 21 \
        --overwrite_output_dir \
        --max_source_length 64 \
        --max_target_length 64 \
        --num_train_epochs 1 \
        --per_device_train_batch_size $BS \
        --per_device_eval_batch_size $BS \
        --source_lang en_XX \
        --target_lang en_XX \
        --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero3_1.json

Expected behavior save checkpoint after training

ds_report output

[INFO|trainer.py:2250] 2022-01-10 09:42:18,771 >>   Num examples = 65394
[INFO|trainer.py:2253] 2022-01-10 09:42:18,771 >>   Batch size = 12
{'eval_loss': 1.306259274482727, 'eval_runtime': 1585.6465, 'eval_samples_per_second': 41.241, 'eval_steps_per_second': 0.86, 'epoch': 1.0}                                                                                                                                                                                                                          
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [20:22:08<00:00, 559.84s/it][INFO|trainer.py:2003] 2022-01-10 10:10:10,357 >> Saving model checkpoint to /local/nlp/temp/poetryT511B1/checkpoint-801                                                                                                                                                                                                                                             
[INFO|configuration_utils.py:423] 2022-01-10 10:10:10,358 >> Configuration saved in /local/nlp/temp/poetryT511B1/checkpoint-801/config.json
[INFO|modeling_utils.py:1070] 2022-01-10 10:10:10,516 >> Model weights saved in /local/nlp/temp/poetryT511B1/checkpoint-801/pytorch_model.bin
[INFO|tokenization_utils_base.py:2043] 2022-01-10 10:10:10,517 >> tokenizer config file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/tokenizer_config.json
[INFO|tokenization_utils_base.py:2049] 2022-01-10 10:10:10,517 >> Special tokens file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2022-01-10 10:10:10,566 >> Copy vocab file to /local/nlp/temp/poetryT511B1/checkpoint-801/spiece.model
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
  File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
        self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
        self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
        self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:2', requires_grad=True) Cannot partition a param in flight
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:1', requires_grad=True) Cannot partition a param in flight
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:3', requires_grad=True) Cannot partition a param in flight    
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:0', requires_grad=True) Cannot partition a param in flight
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Issue Analytics

State:
Created 2 years ago
Comments:36 (30 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Feb 16, 2022

I appreciate you re-validating the fix. @aphedges!

@tjruwase made it much easier to do so as he tirelessly makes the deepspeed codebase easier to adjust so I had changed its final version.

re: VCS

I just have a clone, which is even faster to git pull 😃

There are many ways to install a module 😉

1reaction

aphedgescommented, Feb 16, 2022

Today I tried out the revised fix that made it into https://github.com/microsoft/DeepSpeed/commit/baef92e26fef5aa0da63f26d444b91c2a7aa0bd3 on my full script, and it worked properly! Thank you very much for your work on fixing this issue!

I am aware of the VCS installation in pip and it’s what I usually use, but it seems to be slightly faster to let GitHub zip it first, at least with n=1. I guess it also helps if one doesn’t have Git on their system for some reason.

Top Results From Across the Web

Trainer option to disable saving DeepSpeed checkpoints

I'd like to ask for opinions about adding a Trainer configuration option to disable saving of DeepSpeed checkpoints (potentially only keeping the model ......

Model Checkpointing — DeepSpeed 0.8.0 documentation

Save a file 'latest' pointing to the latest saved checkpoint. Important: all processes must call this method and not just the process with...

DeepSpeed: Accelerating large-scale model inference and ...

Multi-GPU inference with DeepSpeed for large-scale Transformer models; Compressed training with Progressive Layer Dropping: 2.5x faster ...

BERT Pre-training - DeepSpeed

To use DeepSpeed we need to edit two files : train.py : Main entry point for training; utils.py : Training parameters and checkpoints...