Bart now enforces maximum sequence length in Summarization Pipeline
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Bart (bart-large-cnn)
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
Based on example code in docs, though.
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Load default summarization pipeline
- Try to use model to summarize text that has > 1024 tokens
Example code:
from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570 # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))
Output:
Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "ex.py", line 4, in <module>
print(summarizer(text, max_length=250))
File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
embed_pos = self.embed_positions(input_ids)
File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
return super().forward(positions)
File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Expected behavior
As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.
Expected behavior is to summarize document regardless of size.
Environment info
transformers
version: 2.8.0 (also occurs in 2.9.0)- Platform: Both macOS 10.15.4 and Windows 10
- Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
- PyTorch version (GPU?): 1.5.0, no GPU
- Tensorflow version (GPU?): n/a
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
- Model (from associated JSON file downloaded):
{"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
- Model config:
{
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"early_stopping": true,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"length_penalty": 2.0,
"max_length": 142,
"max_position_embeddings": 1024,
"min_length": 56,
"model_type": "bart",
"no_repeat_ngram_size": 3,
"normalize_before": false,
"num_beams": 4,
"num_hidden_layers": 12,
"output_past": true,
"pad_token_id": 1,
"prefix": " ",
"scale_embedding": false,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 142,
"min_length": 56,
"no_repeat_ngram_size": 3,
"num_beams": 4
}
},
"vocab_size": 50264
}
EDIT: Tagging @sshleifer as recommended by docs
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (5 by maintainers)
Top Results From Across the Web
Bart — transformers 3.1.0 documentation - Hugging Face
'only_second' : Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if...
Read more >On the Summarization and Evaluation of Long Documents
Consequently, the Longformer has a maximum input size of 4,096 tokens compared with 1,024 for BART; this results in the Longformer being able...
Read more >Why Does Huggingface'S Bart Summarizer Replicate The ...
Now I'm using evalsourcemap mode for development only. 'transformers' Python transformers Bart now enforces maximum sequence length in Summarization.
Read more >transformer max sequence length
When the average sequence length is equal to 60% of the maximum, turning on the zero padding algorithm further accelerates the BERT Transformer...
Read more >Multi-Document Summarization with BART | by Ashwin N
In short Summarization is a classis sequence-to-sequence (seq2seq) task ... For now, we will set the maximum lengths to 1024 and 256 for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings In the meantime, you can use this code on the latest transformers:
Hi @ig-perez , I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.
I tried it with: