question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error while saving T5-11B checkpoint

See original GitHub issue

Getting this error which I honestly don’t understand

[INFO|trainer.py:1995] 2021-11-19 01:06:24,979 >> Saving model checkpoint to /local/nlp/temp/poetryT5-11B_new/checkpoint-21
[INFO|configuration_utils.py:417] 2021-11-19 01:06:24,980 >> Configuration saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/config.json
[INFO|modeling_utils.py:1058] 2021-11-19 01:07:05,343 >> Model weights saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/pytorch_model.bin
[INFO|tokenization_utils_base.py:2034] 2021-11-19 01:07:05,345 >> tokenizer config file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/tokenizer_config.json
[INFO|tokenization_utils_base.py:2040] 2021-11-19 01:07:05,345 >> Special tokens file saved in /local/nlp/temp/poetryT5-11B_new/checkpoint-21/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2021-11-19 01:07:05,380 >> Copy vocab file to /local/nlp/temp/poetryT5-11B_new/checkpoint-21/spiece.model
[2021-11-19 01:07:05,399] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /local/nlp/temp/poetryT5-11B_new/checkpoint-21/global_step21/mp_rank_00_model_states.pt
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
    main()
  File "./finetune_trainer.py", line 305, in main
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1391, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
      File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1495, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
        self._save_checkpoint(model, trial, metrics=metrics)self._save_checkpoint(model, trial, metrics=metrics)

  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1561, in _save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self.deepspeed.save_checkpoint(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2304, in save_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    self._save_zero_checkpoint(save_dir, tag)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2556, in _save_zero_checkpoint
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    zero_sd = dict(optimizer_state_dict=self.optimizer.state_dict(),
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1962, in state_dict
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    state_dict['base_optimizer_state'] = self._get_base_optimizer_state()
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1940, in _get_base_optimizer_state
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_optimizer_state = self._get_state_without_padding(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1928, in _get_state_without_padding
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.
    lean_state[key] = value[:lean_length]
IndexError: slice() cannot be applied to a 0-dim tensor.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jeffracommented, Dec 1, 2021

Hi @tuhinjubcse, I see you’ve been working with the excellent @stas00 on some of these issues. I finished reading up on the latest with you two in this issue https://github.com/huggingface/transformers/issues/14531.

As Stas mentioned, once this DeepSpeed PR https://github.com/microsoft/DeepSpeed/pull/1453 is merged you should be able to run ZeRO stage 3 w. BF16 support which should help reduce memory and potentially improve throughput. If you want to give it a try before it’s merged you can checkout and install the branch via this command: pip install git+https://github.com/jfc4050/DeepSpeed.git@s3-pr

0reactions
tuhinjubcsecommented, Nov 23, 2021
  warnings.warn(formatted_warning, FutureWarning)
{'loss': 6.0737, 'learning_rate': 0.0, 'epoch': 0.02}                                                                                                                                                       
{'loss': 0.1926, 'learning_rate': 0.0, 'epoch': 0.04}                                                                                                                                                       
{'loss': 0.0399, 'learning_rate': 0.0, 'epoch': 0.06}                                                                                                                                                       
  8%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                                                                                | 1999/24128 [1:52:11<20:35:01,  3.35s/it][2021-11-22 19:51:55,198] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 19:51:55,199] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=9.546767962244255
{'loss': 0.0749, 'learning_rate': 0.0, 'epoch': 0.08}                                                                                                                                                       
{'loss': 0.408, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                                                                         
{'loss': 0.0354, 'learning_rate': 0.0, 'epoch': 0.12}                                                                                                                                                       
{'loss': 0.0341, 'learning_rate': 0.0, 'epoch': 0.15}                                                                                                                                                       
 17%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                                                                   | 3999/24128 [3:43:57<18:47:06,  3.36s/it][2021-11-22 21:43:41,103] [INFO] [logging.py:69:log_dist] [Rank 0] step=4000, skipped=3999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 21:43:41,103] [INFO] [timer.py:181:stop] 0/4000, SamplesPerSec=9.564911481857864
{'loss': 0.0316, 'learning_rate': 0.0, 'epoch': 0.17}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.19}                                                                                                                                                       
{'loss': 0.035, 'learning_rate': 0.0, 'epoch': 0.21}                                                                                                                                                        
{'loss': 0.1423, 'learning_rate': 0.0, 'epoch': 0.23}                                                                                                                                                       
 25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                                                      | 5999/24128 [5:35:43<16:52:01,  3.35s/it][2021-11-22 23:35:26,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=6000, skipped=5999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-22 23:35:26,678] [INFO] [timer.py:181:stop] 0/6000, SamplesPerSec=9.571203445125207
{'loss': 0.1107, 'learning_rate': 0.0, 'epoch': 0.25}                                                                                                                                                       
{'loss': 0.0467, 'learning_rate': 0.0, 'epoch': 0.27}                                                                                                                                                       
{'loss': 0.0802, 'learning_rate': 0.0, 'epoch': 0.29}                                                                                                                                                       
{'loss': 0.0706, 'learning_rate': 0.0, 'epoch': 0.31}                                                                                                                                                       
 33%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                                         | 7999/24128 [7:27:26<15:00:20,  3.35s/it][2021-11-23 01:27:10,465] [INFO] [logging.py:69:log_dist] [Rank 0] step=8000, skipped=7999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 01:27:10,465] [INFO] [timer.py:181:stop] 0/8000, SamplesPerSec=9.574953735862689
{'loss': 0.22, 'learning_rate': 0.0, 'epoch': 0.33}                                                                                                                                                         
{'loss': 0.0967, 'learning_rate': 0.0, 'epoch': 0.35}                                                                                                                                                       
{'loss': 0.0716, 'learning_rate': 0.0, 'epoch': 0.37}                                                                                                                                                       
{'loss': 0.1111, 'learning_rate': 0.0, 'epoch': 0.39}                                                                                                                                                       
 41%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                                                            | 9999/24128 [9:19:10<13:10:15,  3.36s/it][2021-11-23 03:18:53,863] [INFO] [logging.py:69:log_dist] [Rank 0] step=10000, skipped=9999, lr=[0.0, 0.0], mom=[0.0, 0.0]
[2021-11-23 03:18:53,863] [INFO] [timer.py:181:stop] 0/10000, SamplesPerSec=9.577305314814142
{'loss': 0.2233, 'learning_rate': 0.0, 'epoch': 0.41}                                                                                                                                                       
 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                                                        | 10397/24128 [9:41:24<12:47:24,  3.35s/it]Traceback (most recent call last):
  File "./finetune_trainer.py", line 368, in <module>
    main()
  File "./finetune_trainer.py", line 305, in main
    train_result = trainer.train(
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1316, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1865, in training_step
    loss = self.deepspeed.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1708, in backward
    self.optimizer.backward(loss)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1880, in backward
    buf_1 = torch.empty(int(self.reduce_bucket_size),
RuntimeError: CUDA out of memory. Tried to allocate 382.00 MiB (GPU 1; 39.59 GiB total capacity; 36.01 GiB already allocated; 164.94 MiB free; 36.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also receiving OOM wonder what can I do ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow saving error - python - Stack Overflow
Nvm, turns out I just needed to specfy the checkpoint file format when saving. Changing save_model to self._Save_model.save(sessΒ ...
Read more >
Error occurs when saving model in multi-gpu settings
OSError: Unable to load weights from pytorch checkpoint file for 'xxx' at 'my_model_dir' If you tried to load a PyTorch model from a...
Read more >
"Failed to save object" error message when trying to change ...
Following sk94871 does not resolve the issue. Solution. This problem was fixed. The fix is included in: SmartConsole R80.10. Check Point ...
Read more >
How to Fix the Error: Hyper-V Checkpoint Operation Failed
Hyper-V checkpoint is a feature that allows you to save a virtual machine's state by creating a differencing virtual disk. Any changes made...
Read more >
Sap Table List | PDF | Object (Computer Science) | Net Present Value
Types of Error During Check of Acct Assignment in FI-AA Standard Change ... DI flow: Individual flow DI Flow: Totals Flow Filter Variants...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found