question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird summarization results - the summary is longer than the input

See original GitHub issue

🐛 Bug

Information

Summarization task is returning an unexpected results. For an input of

“We have a telephony partner who is very interested in this program and may be able to help identify pilot customers.”

The results is

[{‘summary_text’: ‘“We have a telephony partner who is very interested in this program and may be able to help identify pilot customers,” the company says. “We are looking at a number of different ways to get people talking to each other,” it adds. “It's a very exciting time for us,” says the company's chief operating officer.’}]

Model I am using (Bert, XLNet …): Summarization pipeline

Language I am using the model on (English, Chinese …): Eng

The problem arises when using:

  • the official example scripts: (give details below)
  • [V ] my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [V ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Execute below script
!pip install -q transformers --upgrade
from transformers import pipeline
summarizer = pipeline(task="summarization")

data = "We have a telephony partner who is very interested in this program and may be able to help identify pilot customers."
print(summarizer(data))

Expected behavior

Would expect the summary to 1) not add contextual information that doesn’t exist, and 2) to not be longer than the input. Arguably the input is short but still…

Environment info

Colab

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
metahgvacommented, Apr 6, 2020

The logic of the program is “generate the most likely summary” of between min_length and max_length. So it’s not programmed to cut the summary in a rules based way. Thanks for confirming - seems to be the right approach 😃!

You might get better results with summarizer = pipeline(task=“summarization”, model=‘bart-large-xsum’) . Ok, will give it a try then!

With that in mind, I’ve also seen poor results summarizing documents that are very different than the finetuning distribution (news articles of ~1024 tokens). So you want to keep it open as a bug or should we close?

As a side request, it would be awesome to have metrics associated with each models that are part of transformers to help users choose the right one for their job (cc: @julien-c ).

1reaction
sshleifercommented, May 10, 2020

Unfortunately, Bart can only process 1024 tokens at once, so your best best would be to split your doc into chunks, summarize each one, and concatenate the summaries.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summarization - Hugging Face Course
In this section we'll take a look at how Transformer models can be used to condense long documents into summaries, a task known...
Read more >
Unexpected results from Summarize Within - Esri Community
Solved: I've created a walk-time map in ArcGIS Online, based on library locations within a city, using 5-minute increments between 0-5 mins ...
Read more >
Abstractive Text Summarization - Medium
With this setting, the model is able to selectively focus on useful parts of the input sequence and hence, learn the alignment between...
Read more >
Text Summarization - an overview | ScienceDirect Topics
Google makes a short text summarization of the most important item and places the summary at the head of the list of search...
Read more >
A Gentle Introduction to Text Summarization
Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document. Automatic text summarization ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found