Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RoBERTa: Truncation error: Sequence to truncate too short to respect the provided max_length

See original GitHub issue

Environment info

transformers version: 4.9.0
Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.11
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Flax version (CPU?/GPU?/TPU?): TPU
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Error is coming with both GPU and TPU
Using distributed or parallel set-up in script?: No

Who can help

Models: RoBERTa - @LysandreJik, @patrickvonplaten, @patil-suraj,

Library:

tokenizers: @LysandreJik

Information

Model I am using RoBERTa model for SQuAD 2.0 and getting below error when trying to tokenize the Question and context pair:

The problem arises when using: The tasks I am working on is:

an official GLUE/SQUaD task: SQuAD 2.0

To reproduce

Steps to reproduce the behavior: I am trying to tokenize SQuAD 2.0 dataset using roberta-base tokenizer and model but it has started giving me below error. This code snippet was working till few days before and now it is giving below error without changing anything.

  model_args = ModelArguments(
    model_checkpoint=model_checkpoint,
    token_checkpoint=token_checkpoint,
    squad_v2=True,
    max_length=384,
    doc_stride=128,
    batch_size=8,
    n_best_size=25,
    max_answer_length=30,
    min_null_score=7.0, ##FOR ROBERTa
    NA_threshold=-3,
    pad_side="right")

    token_checkpoint = "roberta-base"
    model_checkpoint= "roberta-base"

   tokenizer = AutoTokenizer.from_pretrained(token_checkpoint)
   model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint,        
                                                        attention_probs_dropout_prob=0.2,
                                                        hidden_dropout_prob=0.2)

    datasets = load_dataset("squad_v2" if model_args.squad_v2 else "squad")

    tokenized_examples = tokenizer(
        datasets["question" if model_args.pad_side else "context"],
        datasets["context" if model_args.pad_side else "question"],
        truncation="only_second" if model_args.pad_side else "only_first",
        max_length=model_args.max_length,
        stride=model_args.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

ERROR messages: Truncation error: Sequence to truncate too short to respect the provided max_length Traceback (most recent call last): File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 323, in _start_fn fn(gindex, *args) File “<ipython-input-14-3842fd6863c2>”, line 75, in pipeline tokenized_datasets = datasets.map(prepare_train_features, batched=True, batch_size=1000,remove_columns=datasets[“train”].column_names) File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in map for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in <dictcomp> for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1679, in map desc=desc, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 185, in wrapper out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 397, in wrapper out = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2014, in _map_single offset=offset, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1900, in apply_function_on_filtered_inputs function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs) File “<ipython-input-6-54e98dcfc55e>”, line 14, in prepare_train_features padding=“max_length”, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2385, in call **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2570, in batch_encode_plus **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py”, line 163, in _batch_encode_plus return super()._batch_encode_plus(*args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py”, line 408, in _batch_encode_plus is_pretokenized=is_split_into_words, Exception: Truncation error: Sequence to truncate too short to respect the provided max_length

Expected behavior

SQuAD 2.0 dataset should be tokenized without any issue.

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Jul 28, 2021

I have fixed the example notebook and the PR mentioned above shows how to fix it in the example scripts.

1reaction

PremalMataliacommented, Aug 1, 2021

Thanks for fixing this issue.

Top Results From Across the Web

Padding and truncation - Hugging Face

In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept...

How does max_length, padding and truncation arguments ...

padding=max_length will add a padding of 1 to the third sentence; truncate=True will truncate the first and second sentence so that their length ......

truncation.rs - source - Docs.rs

Source of the Rust file `src/utils/truncation.rs`. ... fmt, "Truncation error: Sequence to truncate too short to respect the provided max_length" ) ...

sampleBERT - Kaggle

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max ...

How to Apply Transformers to Any Length of Text

Restore the power of NLP for long sequences ... transformer models) will consume 512 tokens max — truncating anything beyond this length.