Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ideas for visualising key phrases together with text, as a modelling aid

See original GitHub issue

Just wanted to see what people thought about this…

I’ve been playing about with keyphrase extraction and, as well as looking at the altair plot pyTextRank produces, found it helpful to display the text with the key phrases. I ended up “hacking” the doc.ents and using spaCy’s displacy, so it’s not necessarily clean and therefore not sure how it could be added (as is), but thought I would share as I do think it would make a nice exploratory/modelling feature, similar to the extra viz functionality. On the other hand, it might be a common hack, and people might know it, but I haven’t seen it elsewhere.

Here is an example output:

NOTE: It is only displaying the top 10 key phrases as the colours get quite busy, but one can easily drop the colouring.

And here is the code to reproduce and play with it:

# %%
import en_core_web_sm
import pytextrank
import random
import spacy

# %%
def generate_colour():
    random_number = random.randint(0, 16777215)
    hex_number = str(hex(random_number))
    hex_number = "#" + hex_number[2:]
    return hex_number

# %%
def hack_ents(doc, n_phrases=10, precision=5):
    phrases = doc._.phrases

    ## filter to top n_phrases
    if (n_phrases is not None) and len(phrases) > n_phrases:
        phrases = phrases[0:n_phrases]

    keyphrases = []
    for p in phrases:
        if p.rank > 0:
            for chunk in p.chunks:
                chunk.label_ = str(round(p.rank, precision))
                keyphrases.append(chunk)
    ## NOTE removing keyphrases that overlap
    keyphrases = spacy.util.filter_spans(keyphrases)

    doc.ents = []
    doc.ents = keyphrases
    
    return doc

# %%
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);

# %%
# from dat/lee.txt
text = """
After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
"""

# %%
doc = nlp(text)

# %%
doc = hack_ents(doc)

labels = [e.label_ for e in doc.ents]
colours = {label: generate_colour() for label in labels}
options = {"colors": colours}

# options = {}

spacy.displacy.render(
    doc, style="ent", options=options, page=False, jupyter=True, minify=True
)

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

Hellisotherpeoplecommented, Dec 5, 2021

I am about to experiment with integrating this package into my webapp - https://huggingface.co/spaces/Hellisotherpeople/Unsupervised_Extractive_Summarization which is a (somewhat incomplete) port of my package CX_DB8 - https://github.com/Hellisotherpeople/CX_DB8

I feel compelled to link it here as I independently tackled this problem (visualizations of extractive summaries), and at the time I was unaware of this package (and I don’t think it existed quite in its current form either!). It may help or at least give inspiration.

@DayalStrub @ceteri thank you both for the hard work on this project and making my life a LOT easier in the next few weeks. My goal is to have a webapp which hosts basically every single technique we can think of for extractive and query focused extractive summarization. This will also need to eventually include MMR and related methods (which are implemented in KeyBERT)

1reaction

tomaarsencommented, Aug 31, 2021

This seems like a wonderful way to very quickly and clearly show both what PyTextRank does, and how it can be used. The produced image can work great as a graphical elevator pitch, and having another example of how PTR can be used is always preferred.

I’d be in favor of:

Including a smaller version of the above image in the README, to very quickly grab the attention of those who just happen to stumble upon this repository.
Including the code in the examples folder, either as a standalone jupyter notebook, or included in the existing sanmple.ipynb.

Perhaps the README image can be a link to the jupyter notebook directly. That way, developers can play around with PTR within minutes.