question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bart now enforces maximum sequence length in Summarization Pipeline

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bart (bart-large-cnn)

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

Based on example code in docs, though.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Load default summarization pipeline
  2. Try to use model to summarize text that has > 1024 tokens

Example code:

from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570    # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))

Output:

Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "ex.py", line 4, in <module>
    print(summarizer(text, max_length=250))
  File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
    inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
  File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
    encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
    embed_pos = self.embed_positions(input_ids)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
    return super().forward(positions)
  File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Expected behavior

As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.

Expected behavior is to summarize document regardless of size.

Environment info

  • transformers version: 2.8.0 (also occurs in 2.9.0)
  • Platform: Both macOS 10.15.4 and Windows 10
  • Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
  • PyTorch version (GPU?): 1.5.0, no GPU
  • Tensorflow version (GPU?): n/a
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no
  • Model (from associated JSON file downloaded): {"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
  • Model config:
{
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 2.0,
  "max_length": 142,
  "max_position_embeddings": 1024,
  "min_length": 56,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "num_beams": 4,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "prefix": " ",
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "no_repeat_ngram_size": 3,
      "num_beams": 4
    }
  },
  "vocab_size": 50264
}

EDIT: Tagging @sshleifer as recommended by docs

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:20 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
sshleifercommented, May 10, 2020

@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings In the meantime, you can use this code on the latest transformers:

from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s) for s in summary_ids]
    return summaries

text = '=' * 10257  
old_summarization_pipeline(text)
3reactions
constantin-huetterercommented, May 28, 2021

Hi @ig-perez , I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
    return summaries

print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400]))

I tried it with:

  • transformers=4.4.2
  • pytorch=1.8.0=py3.8_cuda10.2_cudnn7.6.5_0
Read more comments on GitHub >

github_iconTop Results From Across the Web

Bart — transformers 3.1.0 documentation - Hugging Face
'only_second' : Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if...
Read more >
On the Summarization and Evaluation of Long Documents
Consequently, the Longformer has a maximum input size of 4,096 tokens compared with 1,024 for BART; this results in the Longformer being able...
Read more >
Why Does Huggingface'S Bart Summarizer Replicate The ...
Now I'm using evalsourcemap mode for development only. 'transformers' Python transformers Bart now enforces maximum sequence length in Summarization.
Read more >
transformer max sequence length
When the average sequence length is equal to 60% of the maximum, turning on the zero padding algorithm further accelerates the BERT Transformer...
Read more >
Multi-Document Summarization with BART | by Ashwin N
In short Summarization is a classis sequence-to-sequence (seq2seq) task ... For now, we will set the maximum lengths to 1024 and 256 for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found