RoBERTa: Truncation error: Sequence to truncate too short to respect the provided max_length
See original GitHub issueEnvironment info
transformers
version: 4.9.0- Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.11
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): 2.5.0 (False)
- Flax version (CPU?/GPU?/TPU?): TPU
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Error is coming with both GPU and TPU
- Using distributed or parallel set-up in script?: No
Who can help
Models: RoBERTa - @LysandreJik, @patrickvonplaten, @patil-suraj,
Library:
- tokenizers: @LysandreJik
Information
Model I am using RoBERTa model for SQuAD 2.0 and getting below error when trying to tokenize the Question and context pair:
The problem arises when using: The tasks I am working on is:
- an official GLUE/SQUaD task: SQuAD 2.0
To reproduce
Steps to reproduce the behavior: I am trying to tokenize SQuAD 2.0 dataset using roberta-base tokenizer and model but it has started giving me below error. This code snippet was working till few days before and now it is giving below error without changing anything.
model_args = ModelArguments(
model_checkpoint=model_checkpoint,
token_checkpoint=token_checkpoint,
squad_v2=True,
max_length=384,
doc_stride=128,
batch_size=8,
n_best_size=25,
max_answer_length=30,
min_null_score=7.0, ##FOR ROBERTa
NA_threshold=-3,
pad_side="right")
token_checkpoint = "roberta-base"
model_checkpoint= "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(token_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint,
attention_probs_dropout_prob=0.2,
hidden_dropout_prob=0.2)
datasets = load_dataset("squad_v2" if model_args.squad_v2 else "squad")
tokenized_examples = tokenizer(
datasets["question" if model_args.pad_side else "context"],
datasets["context" if model_args.pad_side else "question"],
truncation="only_second" if model_args.pad_side else "only_first",
max_length=model_args.max_length,
stride=model_args.doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
ERROR messages: Truncation error: Sequence to truncate too short to respect the provided max_length Traceback (most recent call last): File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 323, in _start_fn fn(gindex, *args) File “<ipython-input-14-3842fd6863c2>”, line 75, in pipeline tokenized_datasets = datasets.map(prepare_train_features, batched=True, batch_size=1000,remove_columns=datasets[“train”].column_names) File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in map for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in <dictcomp> for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1679, in map desc=desc, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 185, in wrapper out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 397, in wrapper out = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2014, in _map_single offset=offset, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1900, in apply_function_on_filtered_inputs function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs) File “<ipython-input-6-54e98dcfc55e>”, line 14, in prepare_train_features padding=“max_length”, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2385, in call **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2570, in batch_encode_plus **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py”, line 163, in _batch_encode_plus return super()._batch_encode_plus(*args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py”, line 408, in _batch_encode_plus is_pretokenized=is_split_into_words, Exception: Truncation error: Sequence to truncate too short to respect the provided max_length
Expected behavior
SQuAD 2.0 dataset should be tokenized without any issue.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
I have fixed the example notebook and the PR mentioned above shows how to fix it in the example scripts.
Thanks for fixing this issue.