Not being able to save T5-11B checkpoint using deepspeed
See original GitHub issueDescribe the bug A clear and concise description of what the bug is. Not being able to save T5-11B checkpoint using deepspeed
To Reproduce Steps to reproduce the behavior:
export BS=12;
PYTHONPATH=../../../src
USE_TF=0
deepspeed --num_gpus=4 ./run_translation.py \
--model_name_or_path /local/nlp/temp/poetryT511B0/checkpoint-801 \
--output_dir /local/nlp/temp/poetryT511B1 \
--evaluation_strategy=steps \
--save_strategy=epoch \
--eval_steps 200 \
--save_steps 200 \
--do_train \
--do_eval \
--train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
--validation_file /home/tuhin.chakr/gpt3/poetrynew/val.json \
--learning_rate 1e-3 \
--gradient_accumulation_steps 21 \
--overwrite_output_dir \
--max_source_length 64 \
--max_target_length 64 \
--num_train_epochs 1 \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--source_lang en_XX \
--target_lang en_XX \
--deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero3_1.json
Expected behavior save checkpoint after training
ds_report output
[INFO|trainer.py:2250] 2022-01-10 09:42:18,771 >> Num examples = 65394
[INFO|trainer.py:2253] 2022-01-10 09:42:18,771 >> Batch size = 12
{'eval_loss': 1.306259274482727, 'eval_runtime': 1585.6465, 'eval_samples_per_second': 41.241, 'eval_steps_per_second': 0.86, 'epoch': 1.0}
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 801/801 [20:22:08<00:00, 559.84s/it][INFO|trainer.py:2003] 2022-01-10 10:10:10,357 >> Saving model checkpoint to /local/nlp/temp/poetryT511B1/checkpoint-801
[INFO|configuration_utils.py:423] 2022-01-10 10:10:10,358 >> Configuration saved in /local/nlp/temp/poetryT511B1/checkpoint-801/config.json
[INFO|modeling_utils.py:1070] 2022-01-10 10:10:10,516 >> Model weights saved in /local/nlp/temp/poetryT511B1/checkpoint-801/pytorch_model.bin
[INFO|tokenization_utils_base.py:2043] 2022-01-10 10:10:10,517 >> tokenizer config file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/tokenizer_config.json
[INFO|tokenization_utils_base.py:2049] 2022-01-10 10:10:10,517 >> Special tokens file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2022-01-10 10:10:10,566 >> Copy vocab file to /local/nlp/temp/poetryT511B1/checkpoint-801/spiece.model
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:2', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:1', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:3', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:0', requires_grad=True) Cannot partition a param in flight
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Issue Analytics
- State:
- Created 2 years ago
- Comments:36 (30 by maintainers)
Top Results From Across the Web
Trainer option to disable saving DeepSpeed checkpoints
I'd like to ask for opinions about adding a Trainer configuration option to disable saving of DeepSpeed checkpoints (potentially only keeping the modelΒ ......
Read more >Model Checkpointing β DeepSpeed 0.8.0 documentation
Save a file 'latest' pointing to the latest saved checkpoint. Important: all processes must call this method and not just the process with...
Read more >DeepSpeed: Accelerating large-scale model inference and ...
Multi-GPU inference with DeepSpeed for large-scale Transformer models; Compressed training with Progressive Layer Dropping: 2.5x fasterΒ ...
Read more >BERT Pre-training - DeepSpeed
To use DeepSpeed we need to edit two files : train.py : Main entry point for training; utils.py : Training parameters and checkpoints...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I appreciate you re-validating the fix. @aphedges!
@tjruwase made it much easier to do so as he tirelessly makes the deepspeed codebase easier to adjust so I had changed its final version.
re: VCS
I just have a clone, which is even faster to
git pull
πThere are many ways to install a module π
Today I tried out the revised fix that made it into https://github.com/microsoft/DeepSpeed/commit/baef92e26fef5aa0da63f26d444b91c2a7aa0bd3 on my full script, and it worked properly! Thank you very much for your work on fixing this issue!
I am aware of the VCS installation in pip and itβs what I usually use, but it seems to be slightly faster to let GitHub zip it first, at least with n=1. I guess it also helps if one doesnβt have Git on their system for some reason.