KeyError in GLUE data tokenization with RoBERTA
See original GitHub issueš Bug
Iām getting a KeyError here when using RoBERTa in examples/run_glue.py and trying to access 'token_type_ids'
while preprocessing the data, maybe from this commit removing 'token_type_ids'
from RoBERTa (and DistilBERT)?
I get the error when fine-tuning RoBERTa on CoLA and RTE. I havenāt tried other tasks, but I think youād get the same error.
I donāt get the error when fine-tuning XLNet (presumably, since XLNet does use 'token_type_ids'
), and I donāt get the error when I do pip install transformers
instead of pip install .
(which I think means the issue is coming from a recent commit).
Hereās the full error message:
03/17/2020 11:53:58 - INFO - transformers.data.processors.glue - Writing example 0/13997
Traceback (most recent call last):
File "examples/run_glue.py", line 731, in <module>
main()
File "examples/run_glue.py", line 679, in main
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
File "examples/run_glue.py", line 419, in load_and_cache_examples
pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
File "/home/ejp416/cmv/transformers/src/transformers/data/processors/glue.py", line 94, in glue_convert_examples_to_features
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
KeyError: 'token_type_ids'
Information
Model I am using (Bert, XLNet ā¦): RoBERTa. I think DistilBERT may run into the same issue as well.
Language I am using the model on (English, Chinese ā¦): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
Iāve made slight modifications to the training loop in the official examples/run_glue.py, but I did not touch the data pre-processing, which is where the error occurs (before any training).
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
Iāve run into the error on CoLA and RTE, though I think the error should happen on all GLUE tasks.
To reproduce
Steps to reproduce the behavior:
- Install
transformers
using the latest clone (usepip install .
notpip install transformers
) - Download the RTE data (e.g., into
data/RTE
using the GLUE download scripts in this repo) - Run a command to train RoBERTa (base or large). Iām using:
python examples/run_glue.py --model_type roberta --model_name_or_path roberta-base --output_dir models/debug --task_name rte --do_train --evaluate_during_training --data_dir data/RTE --max_seq_length 32 --max_grad_norm inf --adam_epsilon 1e-6 --adam_beta_2 0.98 --weight_decay 0.1 --logging_steps 874 --save_steps 874 --num_train_epochs 10 --warmup_steps 874 --per_gpu_train_batch_size 1 --per_gpu_eval_batch_size 2 --learning_rate 1e-5 --seed 12 --gradient_accumulation_steps 16 --overwrite_output_dir
Expected behavior
load_and_cache_examples
(and specifically, the call to convert_examples_to_features
) in examples/run_glue.py
should run without error, to load, preprocess, and tokenize the dataset.
Environment info
transformers
version: 2.5.1- Platform: Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-centos-7.7.1908-Core
- Python version: 3.7.6
- PyTorch version (GPU?): 1.4.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Error happens with both GPU and CPU
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
I also have this issue when i run run_multiple_choice.py in RACE data with RoBERTA.
I get the same error when I try to fine-tune Squad