Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bart now enforces maximum sequence length in Summarization Pipeline

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bart (bart-large-cnn)

Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

Based on example code in docs, though.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Load default summarization pipeline
Try to use model to summarize text that has > 1024 tokens

Example code:

from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570    # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))

Output:

Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "ex.py", line 4, in <module>
    print(summarizer(text, max_length=250))
  File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
    inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
  File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
    encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
    embed_pos = self.embed_positions(input_ids)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
    return super().forward(positions)
  File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Expected behavior

As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.

Expected behavior is to summarize document regardless of size.

Environment info

transformers version: 2.8.0 (also occurs in 2.9.0)
Platform: Both macOS 10.15.4 and Windows 10
Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
PyTorch version (GPU?): 1.5.0, no GPU
Tensorflow version (GPU?): n/a
Using GPU in script?: no
Using distributed or parallel set-up in script?: no
Model (from associated JSON file downloaded): {"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
Model config:

{
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 2.0,
  "max_length": 142,
  "max_position_embeddings": 1024,
  "min_length": 56,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "num_beams": 4,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "prefix": " ",
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "no_repeat_ngram_size": 3,
      "num_beams": 4
    }
  },
  "vocab_size": 50264
}

EDIT: Tagging @sshleifer as recommended by docs

Issue Analytics

State:
Created 3 years ago
Comments:20 (5 by maintainers)

Top GitHub Comments

4reactions

sshleifercommented, May 10, 2020

@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings In the meantime, you can use this code on the latest transformers:

from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s) for s in summary_ids]
    return summaries

text = '=' * 10257  
old_summarization_pipeline(text)

3reactions

constantin-huetterercommented, May 28, 2021

Hi @ig-perez , I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
    return summaries

print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400]))

I tried it with: