A question on keyphrases that are subsets of others and overlapping `Spans`
See original GitHub issueI think the current implementation returns keyphrases that are potential subsets of each other, that this is due to the use of noun_chunks
and ents
, and that this is not the desired output. Specifically, if a document has an entity that is a superset (as far as span start and end is concerned) of a noun chunk (or vice-versa), and both contain a key token, then both will be returned as keyphrases.
While also/possibly linked to the issue of entity linkage (which I’d love to know more about!), this can simply be a matter of defining “entity” boundaries and a “duplication” issue, as the example below with “Seouls Four Seasons hotel” and “Four Seasons”, where I believe one keyphrase is enough and having both is confusing, demonstrates.
Am I missing something? Is this the desired logic?
Example:
from spacy.util import filter_spans
import pytextrank
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
# from dat/lee.txt
text = """
After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul. The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel. The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go. Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players. Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match. "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
"""
doc = nlp(text)
key_spans = []
for phrase in doc._.phrases:
for chunk in phrase.chunks:
key_spans.append(chunk)
print(len(key_spans))
full_set = set([p.text for p in doc._.phrases])
print(full_set)
print(len(filter_spans(key_spans)))
sub_set = set([pytextrank.util.default_scrubber(p) for p in filter_spans(key_spans)])
print(sub_set)
print(full_set - sub_set)
print(sub_set - full_set)
Possible solution?:
all_spans = list(self.doc.noun_chunks) + list(self.doc.ents)
filtered_spans = filter_spans(all_spans)
filtered_phrases = self._collect_phrases(filtered_spans, self.ranks) # replacing all_phrases
instead of
nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks)
ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks)
all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases }
Note:
- My understanding is that
self._get_min_phrases
is doing something else. spacy.util.filter_spans
simply looks for the (first) longest span, which might not be the best solution.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:11 (11 by maintainers)
Top GitHub Comments
Hi @DayalStrub , Quite an interesting use case.
I think we can consider this issue resolved. @ceteri
Sorry for the late reply. It’s been a busy week!
I’ve added a few examples of scrubbing (yours) in #197 .
Re. the use case, we are experimenting really. We get a load of documents, of all types (Word, Emails, Powerpoints, etc.), and have a way for people to search and inspect them, so we were looking into providing keyphrases as metadata, so the users can
We don’t yet have feedback from the users as to whether they are finding it useful though.