GPT2 IndexError: index out of range in functional.py by running run_clm.py when adding any special tokens (even eos and bos only)
See original GitHub issueHi all, I need your help as Iβm stuck on an issue IndexError trying to finetune GPT2 using run_clm.py while adding special tokens. The error is trigger at this line of functional.py:
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
run_clm.py has been βbarelyβ modified just adding the tokens with tokenizer.add_special_tokens See below details of the modification, the args used and the error log.
After weeks of preparing datasets, we hope to use your amazing scripts and library for an awesome AI project, I need your help please! π
Environment info
transformers
version: 4.5.0- Platform: Darwin-20.2.0-x86_64-i386-64bit
- Python version: 3.7.9
- PyTorch version (GPU?): 1.8.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: NO
- Using distributed or parallel set-up in script?: NO
Also tried on Windows OS with CUDA 11.1 same transformers version, same Python version, etc = same issue.
Who can help
@patrickvonplaten, @LysandreJik, @sgugger
Information
Model I am using (Bert, XLNet β¦): GPT2 Medium
The problem arises when using:
- the official example scripts: (give details below)
The tasks I am working on is:
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Run transformers/examples/language-modeling/run_clm.py with the following args (see below). You can probably have the exact same issue using any dataset. It doesnβt look to be a dataset related issue as the training works without the special tokens added.
- The file run_clm.py has been modified slightly just to include eos token, bos token and additional special tokens (see below). The issue persists as long as I add any of these special token. The only solution seems to be to have no special token at all with this GPT2 fine-tuning code which is unfortunate because I need those for my purpose. π
ARGS
python transformers/examples/language-modeling/run_clm.py \
--output_dir "models/output/" \
--model_type "gpt2" \
--model_name_or_path "models/original/" \
--tokenizer_name "gpt2" \
--cache_dir "models/cache/" \
--no_use_fast_tokenizer \
--do_train True \
--train_file "models/datasets/dataset-training-05042021.txt" \
--do_eval True \
--validation_file "models/datasets/dataset-validation-05042021.txt" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--save_steps 500 \
--num_train_epochs 5 \
--learning_rate 5e-5 \
--weight_decay 0 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--no_cuda True \
--seed 123456 \
--fp16 False \
--fp16_opt_level "O1" \
--fp16_backend "auto" \
--fp16_full_eval False \
CODE MODIFICATION
I added this code on line 308 of run_clm.py just before the model.resize_token_embeddings(len(tokenizer)):
special_tokens_dict = {
'bos_token': '<|startoftext|>',
'eos_token': '<|endoftext|>',
'additional_special_tokens': [
"<A>",
"<B>",
"<C>",
"<D>",
"<E>",
"<F>",
"<G>",
"<H>"
]
}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
ISSUE LOGS
04/06/2021 17:48:36 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
04/06/2021 17:48:36 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir=models/output/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr06_17-48-36_BLABLABLA-MacBook-Air.local, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=True, seed=261184, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=models/output/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=0, mp_parameters=)
04/06/2021 17:48:36 - WARNING - datasets.builder - Using custom data configuration default-544362d6d13a5db7
04/06/2021 17:48:36 - WARNING - datasets.builder - Reusing dataset text (/Users/blablabla/.cache/huggingface/datasets/text/default-544362d6d13a5db7/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
[INFO|configuration_utils.py:488] 2021-04-06 17:48:36,800 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:48:36,802 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.5.0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|configuration_utils.py:490] 2021-04-06 17:48:37,245 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at models/cache/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:526] 2021-04-06 17:48:37,247 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.5.0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at models/cache/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at models/cache/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at models/cache/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1050] 2021-04-06 17:48:39,223 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:48:45,948 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1177] 2021-04-06 17:48:45,949 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,949 >> Assigning <|startoftext|> to the bos_token key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <|startoftext|> to the vocabulary
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning <|endoftext|> to the eos_token key of the tokenizer
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning ['<A>', '<B>', '<C>', '<D>', '<E>', '<F>', '<G>', '<H>'] to the additional_special_tokens key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <A> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <B> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <C> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <D> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <E> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <F> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <G> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <H> to the vocabulary
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 199/199 [01:15<00:00, 2.62ba/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:03<00:00, 2.69ba/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 199/199 [01:02<00:00, 3.17ba/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:02<00:00, 3.39ba/s]
[INFO|trainer.py:921] 2021-04-06 17:51:21,859 >> Loading model from models/original/).
[INFO|configuration_utils.py:488] 2021-04-06 17:51:21,924 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:51:21,931 >> Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"gradient_checkpointing": false,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.5.0",
"use_cache": true,
"vocab_size": 50257
}
[INFO|modeling_utils.py:1050] 2021-04-06 17:51:21,950 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:51:31,409 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:1177] 2021-04-06 17:51:31,409 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|trainer.py:1013] 2021-04-06 17:51:31,478 >> ***** Running training *****
[INFO|trainer.py:1014] 2021-04-06 17:51:31,483 >> Num examples = 8199
[INFO|trainer.py:1015] 2021-04-06 17:51:31,489 >> Num Epochs = 5
[INFO|trainer.py:1016] 2021-04-06 17:51:31,489 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1017] 2021-04-06 17:51:31,489 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1018] 2021-04-06 17:51:31,489 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1019] 2021-04-06 17:51:31,489 >> Total optimization steps = 40995
0%| | 0/40995 [00:00<?, ?it/s]Traceback (most recent call last):
File "transformers/examples/language-modeling/run_clm.py", line 459, in <module>
main()
File "transformers/examples/language-modeling/run_clm.py", line 424, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train
tr_loss += self.training_step(model, inputs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1524, in training_step
loss = self.compute_loss(model, inputs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1556, in compute_loss
outputs = model(**inputs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 917, in forward
return_dict=return_dict,
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 694, in forward
inputs_embeds = self.wte(input_ids)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 158, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/functional.py", line 1921, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
0%| | 0/40995 [00:00<?, ?it/s]
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
Great thank you so much! @sgugger @LysandreJik That makes sense now, I removed the line and it works perfectly. π
I will let you know when we get closer to a launch date for our AI based game. Itβs going to be awesome! Sorry to troll this thread but does Huggingface has a place to showcase apps made using your incredible libraries? π
Ah, this is because your checkpoint should have the resized weights: itβs resized inside the script but since itβs a local folder, itβs also passed as a checkpoint to the Trainer later in the script, which then reloads the model from that folder without the
model.resize_token_embeddings(len(tokenizer))
this time. So you have two solutions:model.resize_token_embeddings(len(tokenizer))
then resave it.