question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parameter missing from state_dict of optimizer when loading from checkpoint

See original GitHub issue

Environment info

  • transformers version: '4.2.0dev0'
  • Platform: Debian
  • Python version: Python 3.6.10 |Anaconda, Inc.| (default, May 8 2020, 02:54:21)
  • PyTorch version (GPU?): torch-xla-1.6
  • Tensorflow version (GPU?):
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: Yes
  • Using TPUs

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • [ X] the official example scripts: (give details below)
  • [ X] my own modified scripts: (give details below)

The tasks I am working on is:

  • MLM

To reproduce

You need to load a model from a checkpoint saved on the TPU.

Steps to reproduce the behavior:

  1. Run run_mlm.py on any dataset and store a checkpoint. Then load from that checkpoint using the following command.
  2. python transformers/examples/language-modeling/run_mlm.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps 100000 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_name_or_path ../../bucket/model_outputs/en/inverted_order_500K/mlm/checkpoint-10000 --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/valid.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm
  3. OR, use this nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps 100000 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_name_or_path ../../bucket/model_outputs/en/inverted_order_500K/mlm/checkpoint-10000 --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/valid.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm

Error trace

This error trace uses a modified Trainer, but the issue occurs with the original Trainer as well.

Traceback (most recent call last): File “transformers/examples/xla_spawn.py”, line 85, in <module> main() File “transformers/examples/xla_spawn.py”, line 81, in main xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py”, line 292, in spawn _start_fn(0, pf_cfg, fn, args) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py”, line 229, in _start_fn fn(gindex, *args) File “/home/asd/source_code/Multilingual/transformers/examples/language-modeling/run_mlm_synthetic.py”, line 486, in _mp_fn main() File “/home/asd/source_code/Multilingual/transformers/examples/language-modeling/run_mlm_synthetic.py”, line 460, in main trainer.train(model_path=model_path) File “/home/asd/source_code/Multilingual/transformers/src/transformers/trainer_word_modifications.py”, line 666, in train self._load_optimizer_and_scheduler(model_path) File “/home/asd/source_code/Multilingual/transformers/src/transformers/trainer_word_modifications.py”, line 1003, in _load_optimizer_and_scheduler self.optimizer.load_state_dict(optimizer_state) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 123, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

Where is the issue?

I’ve isolated the issue to be a missing parameter in optimizer_state['state']. For some reason, index 136 is missing from optimizer_state['state'].keys()

The following is the debugger output in function _load_optimizer_and_scheduler and just before line self.optimizer.load_state_dict(optimizer_state) in block if is_torch_tpu_available().

>>> optimizer_state['param_groups']
[{'weight_decay': 0.0, 'lr': 0.0001, 'betas': [0.9, 0.999], 'eps': 1e-08, 'correct_bias': True, 'initial_lr': 0.0001, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]}, {'weight_decay': 0.0, 'lr': 0.0001, 'betas': [0.9, 0.999], 'eps': 1e-08, 'correct_bias': True, 'initial_lr': 0.0001, 'params': [54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139]}]
>>> optimizer_state['state'].keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139])

Expected behavior

Load the checkpoint correctly.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
ameet-1997commented, Aug 11, 2021

Looks like you already found the solution, thanks for that! I wasn’t able to fix it earlier.

0reactions
finiteautomatacommented, Aug 5, 2021

I’m facing the same problem with 4.10.0.dev0. @ameet-1997 could you find a solution for this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

optimizer load_state_dict() problem? · Issue #2830 · pytorch ...
Try moving optimizer state to the GPU memory manually after loading it from the checkpoint. optimizer = optim.Adam() optimizer.load_state_dict( ...
Read more >
Error while saving state of optimizer in Pytorch - torchx
Hi all, I am saving state of optimizer and getting this error “state_dict() missing 1 required positional argument: 'self'”.
Read more >
Problem with missing and unexpected keys while loading my ...
Unfortunately I'm very beginner and I face some problems. I have created checkpoint: checkpoint = {'epoch': epochs, 'model_state_dict': model.
Read more >
Saving and Loading Models — PyTorch Tutorials 1.0.0 ...
When saving a general checkpoint, to be used for either inference or resuming training, you must save more than just the model's state_dict....
Read more >
Saving and Loading — skorch 0.12.1 documentation
Under the hood, this saves the module 's state_dict , i.e. only the weights ... In addition to saving the model parameters, the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found