Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Set Language through Config() doesn't set language specific stopwords and Article's keyword problem.

See original GitHub issue

Hi.

I’ve seen that there’s a bug where when you are setting the config of the newspaper.build() it doesn’t load the stopwords array for the desired language. The nlp keywords function is predefined to load a stopwords file like this:

with open(settings.NLP_STOPWORDS_EN, 'r') as f:
    stopwords = set([w.strip() for w in f.readlines()])

And the function:

def keywords(text):
    """Get the top 10 keywords and their frequency scores ignores blacklisted
    words in stopwords, counts the number of occurrences of each word, and
    sorts them in reverse natural order (so descending) by number of
    occurrences.
    """
    NUM_KEYWORDS = 10
    text = split_words(text)
    # of words before removing blacklist words
    if text:
        num_words = len(text)
        text = [x for x in text if x not in stopwords]
        freq = {}
        for word in text:
            if word in freq:
                freq[word] += 1
            else:
                freq[word] = 1

        min_size = min(NUM_KEYWORDS, len(freq))
        keywords = sorted(freq.items(),
                          key=lambda x: (x[1], x[0]),
                          reverse=True)
        keywords = keywords[:min_size]
        keywords = dict((x, y) for x, y in keywords)

        for k in keywords:
            articleScore = keywords[k]*1.0 / max(num_words, 1)
            keywords[k] = articleScore * 1.5 + 1
        return dict(keywords)
    else:
        return dict()

So there’s no difference in telling in the config class what language to use, because nlp() always calls the previous function which loads the English nlp (which it’s in misc/ directory and it’s different from the text/'s directory one) stopwords file.

Could you explain if this is done on purpose or there’s really something wrong or to be implemented in language specific keyword making.

Issue Analytics

State:
Created 9 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

yprezcommented, Mar 14, 2016

# NLP stopwords are != regular stopwords for now...
NLP_STOPWORDS_EN = os.path.join(
    PARENT_DIRECTORY, 'resources/misc/stopwords-nlp-en.txt')

(https://github.com/codelucas/newspaper/blob/41b930b467979577710b86ecb93c2a952e5c9a0d/newspaper/settings.py#L28)

@codelucas do you remember why nlp.py uses a different stopwords file?

0reactions

codelucascommented, Oct 22, 2017

Hi all, thanks for the error reporting and effort on this! @minuscorp @raspooti @Brandl @yprez @bartvanremortele @Cabu.

This has finally been fixed in: #438 There is no reason for nlp defaulting to en stopwords while parse uses the proper language specific stopwords file. It was an implementation bug. However I recall splitting up the stopwords files for nlp and parse separately for performance reasons, the keyword extraction was poor when using the normal stopwords file.