Set Language through Config() doesn't set language specific stopwords and Article's keyword problem.
See original GitHub issueHi.
I’ve seen that there’s a bug where when you are setting the config of the newspaper.build() it doesn’t load the stopwords array for the desired language. The nlp keywords function is predefined to load a stopwords file like this:
with open(settings.NLP_STOPWORDS_EN, 'r') as f:
stopwords = set([w.strip() for w in f.readlines()])
And the function:
def keywords(text):
"""Get the top 10 keywords and their frequency scores ignores blacklisted
words in stopwords, counts the number of occurrences of each word, and
sorts them in reverse natural order (so descending) by number of
occurrences.
"""
NUM_KEYWORDS = 10
text = split_words(text)
# of words before removing blacklist words
if text:
num_words = len(text)
text = [x for x in text if x not in stopwords]
freq = {}
for word in text:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
min_size = min(NUM_KEYWORDS, len(freq))
keywords = sorted(freq.items(),
key=lambda x: (x[1], x[0]),
reverse=True)
keywords = keywords[:min_size]
keywords = dict((x, y) for x, y in keywords)
for k in keywords:
articleScore = keywords[k]*1.0 / max(num_words, 1)
keywords[k] = articleScore * 1.5 + 1
return dict(keywords)
else:
return dict()
So there’s no difference in telling in the config class what language to use, because nlp() always calls the previous function which loads the English nlp (which it’s in misc/ directory and it’s different from the text/'s directory one) stopwords file.
Could you explain if this is done on purpose or there’s really something wrong or to be implemented in language specific keyword making.
Issue Analytics
- State:
- Created 9 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Text pre-processing: Stop words removal using different libraries
Stop words are the most common words in any language and do not add much information to the text. They are filtered out...
Read more >Define own language specific set of stop-words from file in ...
This question already has answers here: Adding words to nltk stoplist ...
Read more >Customize stop words - Algolia
Learn how to customize stop words dictionaries through the dashboard.
Read more >Multilingual Rapid Automatic Keyword Extraction (RAKE) for ...
Automatic keyword extraction from text written in any language; No need to know language of text beforehand; No need to have list of...
Read more >Chapter 3 Stop words
In this chapter, we will investigate what a stop word list is, the differences between them, and the effects of using them in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
(https://github.com/codelucas/newspaper/blob/41b930b467979577710b86ecb93c2a952e5c9a0d/newspaper/settings.py#L28)
@codelucas do you remember why nlp.py uses a different stopwords file?
Hi all, thanks for the error reporting and effort on this! @minuscorp @raspooti @Brandl @yprez @bartvanremortele @Cabu.
This has finally been fixed in: #438 There is no reason for nlp defaulting to
en
stopwords whileparse
uses the proper language specific stopwords file. It was an implementation bug. However I recall splitting up the stopwords files fornlp
andparse
separately for performance reasons, the keyword extraction was poor when using the normal stopwords file.