Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPT2 IndexError: index out of range in functional.py by running run_clm.py when adding any special tokens (even eos and bos only)

See original GitHub issue

Hi all, I need your help as I’m stuck on an issue IndexError trying to finetune GPT2 using run_clm.py while adding special tokens. The error is trigger at this line of functional.py: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

run_clm.py has been “barely” modified just adding the tokens with tokenizer.add_special_tokens See below details of the modification, the args used and the error log.

After weeks of preparing datasets, we hope to use your amazing scripts and library for an awesome AI project, I need your help please! 👍

Environment info

transformers version: 4.5.0
Platform: Darwin-20.2.0-x86_64-i386-64bit
Python version: 3.7.9
PyTorch version (GPU?): 1.8.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Also tried on Windows OS with CUDA 11.1 same transformers version, same Python version, etc = same issue.

Who can help

@patrickvonplaten, @LysandreJik, @sgugger

Information

Model I am using (Bert, XLNet …): GPT2 Medium

The problem arises when using:

the official example scripts: (give details below)

The tasks I am working on is:

my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run transformers/examples/language-modeling/run_clm.py with the following args (see below). You can probably have the exact same issue using any dataset. It doesn’t look to be a dataset related issue as the training works without the special tokens added.
The file run_clm.py has been modified slightly just to include eos token, bos token and additional special tokens (see below). The issue persists as long as I add any of these special token. The only solution seems to be to have no special token at all with this GPT2 fine-tuning code which is unfortunate because I need those for my purpose. 😃

ARGS

python transformers/examples/language-modeling/run_clm.py \
--output_dir "models/output/" \
--model_type "gpt2" \
--model_name_or_path "models/original/" \
--tokenizer_name "gpt2" \
--cache_dir "models/cache/" \
--no_use_fast_tokenizer \
--do_train True \
--train_file "models/datasets/dataset-training-05042021.txt" \
--do_eval True \
--validation_file "models/datasets/dataset-validation-05042021.txt" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--save_steps 500 \
--num_train_epochs 5 \
--learning_rate 5e-5 \
--weight_decay 0 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--no_cuda True \
--seed 123456 \
--fp16 False \
--fp16_opt_level "O1" \
--fp16_backend "auto" \
--fp16_full_eval False \

CODE MODIFICATION

I added this code on line 308 of run_clm.py just before the model.resize_token_embeddings(len(tokenizer)):

    special_tokens_dict = {
        'bos_token': '<|startoftext|>',
        'eos_token': '<|endoftext|>',
        'additional_special_tokens': [
             "<A>",
             "<B>",
             "<C>",
             "<D>",
             "<E>",
             "<F>",
             "<G>",
             "<H>"
         ]
    }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

ISSUE LOGS

04/06/2021 17:48:36 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
04/06/2021 17:48:36 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=models/output/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr06_17-48-36_BLABLABLA-MacBook-Air.local, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=True, seed=261184, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=models/output/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=0, mp_parameters=)
04/06/2021 17:48:36 - WARNING - datasets.builder -   Using custom data configuration default-544362d6d13a5db7
04/06/2021 17:48:36 - WARNING - datasets.builder -   Reusing dataset text (/Users/blablabla/.cache/huggingface/datasets/text/default-544362d6d13a5db7/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
[INFO|configuration_utils.py:488] 2021-04-06 17:48:36,800 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:48:36,802 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:490] 2021-04-06 17:48:37,245 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at models/cache/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:526] 2021-04-06 17:48:37,247 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at models/cache/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at models/cache/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at models/cache/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1050] 2021-04-06 17:48:39,223 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:48:45,948 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1177] 2021-04-06 17:48:45,949 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,949 >> Assigning <|startoftext|> to the bos_token key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <|startoftext|> to the vocabulary
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning <|endoftext|> to the eos_token key of the tokenizer
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning ['<A>', '<B>', '<C>', '<D>', '<E>', '<F>', '<G>', '<H>'] to the additional_special_tokens key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <A> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <B> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <C> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <D> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <E> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <F> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <G> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <H> to the vocabulary
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:15<00:00,  2.62ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.69ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:02<00:00,  3.17ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  3.39ba/s]
[INFO|trainer.py:921] 2021-04-06 17:51:21,859 >> Loading model from models/original/).
[INFO|configuration_utils.py:488] 2021-04-06 17:51:21,924 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:51:21,931 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|modeling_utils.py:1050] 2021-04-06 17:51:21,950 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:51:31,409 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1177] 2021-04-06 17:51:31,409 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|trainer.py:1013] 2021-04-06 17:51:31,478 >> ***** Running training *****
[INFO|trainer.py:1014] 2021-04-06 17:51:31,483 >>   Num examples = 8199
[INFO|trainer.py:1015] 2021-04-06 17:51:31,489 >>   Num Epochs = 5
[INFO|trainer.py:1016] 2021-04-06 17:51:31,489 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1017] 2021-04-06 17:51:31,489 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1018] 2021-04-06 17:51:31,489 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1019] 2021-04-06 17:51:31,489 >>   Total optimization steps = 40995
  0%|                                                                                                                                                                              | 0/40995 [00:00<?, ?it/s]Traceback (most recent call last):
  File "transformers/examples/language-modeling/run_clm.py", line 459, in <module>
    main()
  File "transformers/examples/language-modeling/run_clm.py", line 424, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train
    tr_loss += self.training_step(model, inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1524, in training_step
    loss = self.compute_loss(model, inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1556, in compute_loss
    outputs = model(**inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 917, in forward
    return_dict=return_dict,
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 694, in forward
    inputs_embeds = self.wte(input_ids)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
  0%|                                                                                                                                                                              | 0/40995 [00:00<?, ?it/s]

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

MoonshotQuestcommented, Apr 8, 2021

Great thank you so much! @sgugger @LysandreJik That makes sense now, I removed the line and it works perfectly. 👍

I will let you know when we get closer to a launch date for our AI based game. It’s going to be awesome! Sorry to troll this thread but does Huggingface has a place to showcase apps made using your incredible libraries? 😊

1reaction

sguggercommented, Apr 8, 2021

Ah, this is because your checkpoint should have the resized weights: it’s resized inside the script but since it’s a local folder, it’s also passed as a checkpoint to the Trainer later in the script, which then reloads the model from that folder without the model.resize_token_embeddings(len(tokenizer)) this time. So you have two solutions:

either load your model, apply model.resize_token_embeddings(len(tokenizer)) then resave it.
or remove the line that interprets the folder as a checkpoint here