Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parameter missing from state_dict of optimizer when loading from checkpoint

See original GitHub issue

Environment info

transformers version: '4.2.0dev0'
Platform: Debian
Python version: Python 3.6.10 |Anaconda, Inc.| (default, May 8 2020, 02:54:21)
PyTorch version (GPU?): torch-xla-1.6
Tensorflow version (GPU?):
Using GPU in script?: No
Using distributed or parallel set-up in script?: Yes
Using TPUs

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

[ X] the official example scripts: (give details below)
[ X] my own modified scripts: (give details below)

The tasks I am working on is:

To reproduce

You need to load a model from a checkpoint saved on the TPU.

Steps to reproduce the behavior:

Run run_mlm.py on any dataset and store a checkpoint. Then load from that checkpoint using the following command.
python transformers/examples/language-modeling/run_mlm.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps 100000 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_name_or_path ../../bucket/model_outputs/en/inverted_order_500K/mlm/checkpoint-10000 --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/valid.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm
OR, use this nohup python transformers/examples/xla_spawn.py --num_cores 8 transformers/examples/language-modeling/run_mlm.py --warmup_steps 10000 --learning_rate 1e-4 --save_steps 100000 --max_seq_length 512 --logging_steps 50 --overwrite_output_dir --model_name_or_path ../../bucket/model_outputs/en/inverted_order_500K/mlm/checkpoint-10000 --do_train --do_eval --max_steps 500000 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --train_file ../../bucket/pretrain_data/en/valid.txt --validation_file ../../bucket/pretrain_data/en/valid.txt --output_dir ../../bucket/model_outputs/en/inverted_order_500K/mlm

Error trace

This error trace uses a modified Trainer, but the issue occurs with the original Trainer as well.

Traceback (most recent call last): File “transformers/examples/xla_spawn.py”, line 85, in <module> main() File “transformers/examples/xla_spawn.py”, line 81, in main xmp.spawn(mod._mp_fn, args=(), nprocs=args.num_cores) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py”, line 292, in spawn _start_fn(0, pf_cfg, fn, args) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py”, line 229, in _start_fn fn(gindex, *args) File “/home/asd/source_code/Multilingual/transformers/examples/language-modeling/run_mlm_synthetic.py”, line 486, in _mp_fn main() File “/home/asd/source_code/Multilingual/transformers/examples/language-modeling/run_mlm_synthetic.py”, line 460, in main trainer.train(model_path=model_path) File “/home/asd/source_code/Multilingual/transformers/src/transformers/trainer_word_modifications.py”, line 666, in train self._load_optimizer_and_scheduler(model_path) File “/home/asd/source_code/Multilingual/transformers/src/transformers/trainer_word_modifications.py”, line 1003, in _load_optimizer_and_scheduler self.optimizer.load_state_dict(optimizer_state) File “/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch/optim/optimizer.py”, line 123, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn’t match the size of optimizer’s group

Where is the issue?

I’ve isolated the issue to be a missing parameter in optimizer_state['state']. For some reason, index 136 is missing from optimizer_state['state'].keys()

The following is the debugger output in function _load_optimizer_and_scheduler and just before line self.optimizer.load_state_dict(optimizer_state) in block if is_torch_tpu_available().

>>> optimizer_state['param_groups']
[{'weight_decay': 0.0, 'lr': 0.0001, 'betas': [0.9, 0.999], 'eps': 1e-08, 'correct_bias': True, 'initial_lr': 0.0001, 'params': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]}, {'weight_decay': 0.0, 'lr': 0.0001, 'betas': [0.9, 0.999], 'eps': 1e-08, 'correct_bias': True, 'initial_lr': 0.0001, 'params': [54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139]}]
>>> optimizer_state['state'].keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139])