Error while saving T5-11B checkpoint
See original GitHub issueGetting this error which I honestly donβt understand
[INFO|trainer.py:1995] 2021-11-19 01:06:24,979 >> Saving model checkpoint to /local/nlp/temp/poetryT5-11B_new/checkpoint-21
[INFO|configuration_utils.py:417] 2021-11-19 01:06:24,980 >> Configuration saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/config.json
[INFO|modeling_utils.py:1058] 2021-11-19 01:07:05,343 >> Model weights saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/pytorch_model.bin
[INFO|tokenization_utils_base.py:2034] 2021-11-19 01:07:05,345 >> tokenizer config file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/tokenizer_config.json
[INFO|tokenization_utils_base.py:2040] 2021-11-19 01:07:05,345 >> Special tokens file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2021-11-19 01:07:05,380 >> Copy vocab file to /local/nlp/temp/poetryT5-11B_new/checkpoint-21/spiece.model
[2021-11-19 01:07:05,399] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /local/nlp/temp/poetryT5-11B_new/checkpoint-21/global_step21/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
File "./finetune_trainer.py", line 368, in <module>
main()
File "./finetune_trainer.py", line 305, in main
main()
File "./finetune_trainer.py", line 305, in main
train_result = trainer.train(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
main()
File "./finetune_trainer.py", line 305, in main
train_result = trainer.train(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
train_result = trainer.train(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
self.deepspeed.save_checkpoint(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
self._save_zero_checkpoint(save_dir, tag)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
lean_optimizer_state = self._get_state_without_padding(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
lean_optimizer_state = self._get_state_without_padding(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
lean_optimizer_state = self._get_state_without_padding(
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Tensorflow saving error - python - Stack Overflow
Nvm, turns out I just needed to specfy the checkpoint file format when saving. Changing save_model to self._Save_model.save(sessΒ ...
Read more >Error occurs when saving model in multi-gpu settings
OSError: Unable to load weights from pytorch checkpoint file for 'xxx' at 'my_model_dir' If you tried to load a PyTorch model from a...
Read more >"Failed to save object" error message when trying to change ...
Following sk94871 does not resolve the issue. Solution. This problem was fixed. The fix is included in: SmartConsole R80.10. Check Point ...
Read more >How to Fix the Error: Hyper-V Checkpoint Operation Failed
Hyper-V checkpoint is a feature that allows you to save a virtual machine's state by creating a differencing virtual disk. Any changes made...
Read more >Sap Table List | PDF | Object (Computer Science) | Net Present Value
Types of Error During Check of Acct Assignment in FI-AA Standard Change ... DI flow: Individual flow DI Flow: Totals Flow Filter Variants...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @tuhinjubcse, I see youβve been working with the excellent @stas00 on some of these issues. I finished reading up on the latest with you two in this issue https://github.com/huggingface/transformers/issues/14531.
As Stas mentioned, once this DeepSpeed PR https://github.com/microsoft/DeepSpeed/pull/1453 is merged you should be able to run ZeRO stage 3 w. BF16 support which should help reduce memory and potentially improve throughput. If you want to give it a try before itβs merged you can checkout and install the branch via this command:
pip install git+https://github.com/jfc4050/DeepSpeed.git@s3-pr
Also receiving OOM wonder what can I do ?