question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RoBERTa: Truncation error: Sequence to truncate too short to respect the provided max_length

See original GitHub issue

Environment info

  • transformers version: 4.9.0
  • Platform: Linux-5.4.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.9.0+cu102 (False)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): TPU
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Error is coming with both GPU and TPU
  • Using distributed or parallel set-up in script?: No

Who can help

Models: RoBERTa - @LysandreJik, @patrickvonplaten, @patil-suraj,

Library:

Information

Model I am using RoBERTa model for SQuAD 2.0 and getting below error when trying to tokenize the Question and context pair:

The problem arises when using: The tasks I am working on is:

  • an official GLUE/SQUaD task: SQuAD 2.0

To reproduce

Steps to reproduce the behavior: I am trying to tokenize SQuAD 2.0 dataset using roberta-base tokenizer and model but it has started giving me below error. This code snippet was working till few days before and now it is giving below error without changing anything.

  model_args = ModelArguments(
    model_checkpoint=model_checkpoint,
    token_checkpoint=token_checkpoint,
    squad_v2=True,
    max_length=384,
    doc_stride=128,
    batch_size=8,
    n_best_size=25,
    max_answer_length=30,
    min_null_score=7.0, ##FOR ROBERTa
    NA_threshold=-3,
    pad_side="right")

    token_checkpoint = "roberta-base"
    model_checkpoint= "roberta-base"

   tokenizer = AutoTokenizer.from_pretrained(token_checkpoint)
   model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint,        
                                                        attention_probs_dropout_prob=0.2,
                                                        hidden_dropout_prob=0.2)

    datasets = load_dataset("squad_v2" if model_args.squad_v2 else "squad")

    tokenized_examples = tokenizer(
        datasets["question" if model_args.pad_side else "context"],
        datasets["context" if model_args.pad_side else "question"],
        truncation="only_second" if model_args.pad_side else "only_first",
        max_length=model_args.max_length,
        stride=model_args.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )


ERROR messages: Truncation error: Sequence to truncate too short to respect the provided max_length Traceback (most recent call last): File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File “/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py”, line 323, in _start_fn fn(gindex, *args) File “<ipython-input-14-3842fd6863c2>”, line 75, in pipeline tokenized_datasets = datasets.map(prepare_train_features, batched=True, batch_size=1000,remove_columns=datasets[“train”].column_names) File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in map for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/dataset_dict.py”, line 489, in <dictcomp> for k, dataset in self.items() File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1679, in map desc=desc, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 185, in wrapper out: Union[“Dataset”, “DatasetDict”] = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/fingerprint.py”, line 397, in wrapper out = func(self, *args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 2014, in _map_single offset=offset, File “/usr/local/lib/python3.7/dist-packages/datasets/arrow_dataset.py”, line 1900, in apply_function_on_filtered_inputs function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs) File “<ipython-input-6-54e98dcfc55e>”, line 14, in prepare_train_features padding=“max_length”, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2385, in call **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py”, line 2570, in batch_encode_plus **kwargs, File “/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py”, line 163, in _batch_encode_plus return super()._batch_encode_plus(*args, **kwargs) File “/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py”, line 408, in _batch_encode_plus is_pretokenized=is_split_into_words, Exception: Truncation error: Sequence to truncate too short to respect the provided max_length

Expected behavior

SQuAD 2.0 dataset should be tokenized without any issue.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Jul 28, 2021

I have fixed the example notebook and the PR mentioned above shows how to fix it in the example scripts.

1reaction
PremalMataliacommented, Aug 1, 2021

Thanks for fixing this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Padding and truncation - Hugging Face
In most cases, padding your batch to the length of the longest sequence and truncating to the maximum length a model can accept...
Read more >
How does max_length, padding and truncation arguments ...
padding=max_length will add a padding of 1 to the third sentence; truncate=True will truncate the first and second sentence so that their length ......
Read more >
truncation.rs - source - Docs.rs
Source of the Rust file `src/utils/truncation.rs`. ... fmt, "Truncation error: Sequence to truncate too short to respect the provided max_length" ) ...
Read more >
sampleBERT - Kaggle
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max ...
Read more >
How to Apply Transformers to Any Length of Text
Restore the power of NLP for long sequences ... transformer models) will consume 512 tokens max — truncating anything beyond this length.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found