question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Set Language through Config() doesn't set language specific stopwords and Article's keyword problem.

See original GitHub issue

Hi.

I’ve seen that there’s a bug where when you are setting the config of the newspaper.build() it doesn’t load the stopwords array for the desired language. The nlp keywords function is predefined to load a stopwords file like this:

with open(settings.NLP_STOPWORDS_EN, 'r') as f:
    stopwords = set([w.strip() for w in f.readlines()])

And the function:

def keywords(text):
    """Get the top 10 keywords and their frequency scores ignores blacklisted
    words in stopwords, counts the number of occurrences of each word, and
    sorts them in reverse natural order (so descending) by number of
    occurrences.
    """
    NUM_KEYWORDS = 10
    text = split_words(text)
    # of words before removing blacklist words
    if text:
        num_words = len(text)
        text = [x for x in text if x not in stopwords]
        freq = {}
        for word in text:
            if word in freq:
                freq[word] += 1
            else:
                freq[word] = 1

        min_size = min(NUM_KEYWORDS, len(freq))
        keywords = sorted(freq.items(),
                          key=lambda x: (x[1], x[0]),
                          reverse=True)
        keywords = keywords[:min_size]
        keywords = dict((x, y) for x, y in keywords)

        for k in keywords:
            articleScore = keywords[k]*1.0 / max(num_words, 1)
            keywords[k] = articleScore * 1.5 + 1
        return dict(keywords)
    else:
        return dict()

So there’s no difference in telling in the config class what language to use, because nlp() always calls the previous function which loads the English nlp (which it’s in misc/ directory and it’s different from the text/'s directory one) stopwords file.

Could you explain if this is done on purpose or there’s really something wrong or to be implemented in language specific keyword making.

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
yprezcommented, Mar 14, 2016
# NLP stopwords are != regular stopwords for now...
NLP_STOPWORDS_EN = os.path.join(
    PARENT_DIRECTORY, 'resources/misc/stopwords-nlp-en.txt')

(https://github.com/codelucas/newspaper/blob/41b930b467979577710b86ecb93c2a952e5c9a0d/newspaper/settings.py#L28)

@codelucas do you remember why nlp.py uses a different stopwords file?

0reactions
codelucascommented, Oct 22, 2017

Hi all, thanks for the error reporting and effort on this! @minuscorp @raspooti @Brandl @yprez @bartvanremortele @Cabu.

This has finally been fixed in: #438 There is no reason for nlp defaulting to en stopwords while parse uses the proper language specific stopwords file. It was an implementation bug. However I recall splitting up the stopwords files for nlp and parse separately for performance reasons, the keyword extraction was poor when using the normal stopwords file.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Text pre-processing: Stop words removal using different libraries
Stop words are the most common words in any language and do not add much information to the text. They are filtered out...
Read more >
Define own language specific set of stop-words from file in ...
This question already has answers here: Adding words to nltk stoplist ...
Read more >
Customize stop words - Algolia
Learn how to customize stop words dictionaries through the dashboard.
Read more >
Multilingual Rapid Automatic Keyword Extraction (RAKE) for ...
Automatic keyword extraction from text written in any language; No need to know language of text beforehand; No need to have list of...
Read more >
Chapter 3 Stop words
In this chapter, we will investigate what a stop word list is, the differences between them, and the effects of using them in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found