Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More detailed instructions needed for making (non-English) stop word lists compatible

See original GitHub issue

This issue relates to #10735 re: the improvement of stop word lists.

As stated in the previous discussion, a more detailed documentation for making custom stop-word lists compatible would be more helpful than an updated in-built stop-word list as use cases even within the English language can be very specific.

One of my corpora is a collection of Hiberno-English letters using many outdated word forms as well as words derived from Irish-Gaelic. The txt-files also contain occassional UTF-8 errors and orphaned XML tags inherited from the original documents.

While I acknowledge that removing “lb” for unresolved line-break tags may best be done in the pre-processing, I am also having issues with word forms such as “we’ll”, “won’t” or “'tis” which use apostrophes. Besides, appreviated words such as “oct” for “October” seem to be problematic.

I have changed my stopword list multiple times to include both word forms with and without apostrophes, but although both "October and “Oct” are on my list, “oct” is still being ignored.

Here is a sample script that might help to update the documentation:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

docs=["Tirconaill tír st sráid Oct 30th 22 A Chait A chara Dhílis 20th I you your me mine faith faithful faithfully ye get got we'll I'll 'tis le length legacy tá tú bhí sé sí &amp amp am"]                                
my_stopwords=["'tis","tis","a", "amp", "30th","Oct","22","ye","I","you","20th","me","get","st", "sráid", "we'll", "ll", "le", "tú", "bhí", "faithfully", "tír"]

print(CountVectorizer(stop_words=my_stopwords).fit_transform(docs).A)

vectorizer=text.CountVectorizer(input='docs', stop_words=my_stopwords) 

dtm=vectorizer.fit_transform(docs).toarray() 

vocab=np.array(vectorizer.get_feature_names())

print(dtm.shape)
print(vocab)
print(len(vocab))

num_topics=2

num_top_words=15

clf=decomposition.NMF(n_components=num_topics, random_state=1)

doctopic=clf.fit_transform(dtm) 

topic_words=[] 
for topic in clf.components_:
   
    word_idx=np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])
    
print(topic_words)

The output I got was this:

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]] (1, 17) [‘am’ ‘chait’ ‘chara’ ‘dhílis’ ‘faith’ ‘faithful’ ‘got’ ‘legacy’ ‘length’ ‘mine’ ‘oct’ ‘sé’ ‘sí’ ‘tirconaill’ ‘tá’ ‘we’ ‘your’] 17 [[‘chait’, ‘mine’, ‘sé’, ‘tá’, ‘got’, ‘sí’, ‘chara’, ‘length’, ‘we’, ‘tirconaill’, ‘your’, ‘am’, ‘legacy’, ‘faithful’, ‘oct’], [‘faith’, ‘dhílis’, ‘oct’, ‘faithful’, ‘legacy’, ‘am’, ‘your’, ‘tirconaill’, ‘we’, ‘length’, ‘chara’, ‘sí’, ‘got’, ‘tá’, ‘sé’]]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens [‘oct’, ‘we’] not in stop_words. ‘stop_words.’ % sorted(inconsistent))

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

amuellercommented, May 21, 2020

I think the main issue you’re running up against is that stopwords are matched after normalization, so they need to be lowercase. That probably needs to be pointed out more explicitly in the docs as lower_case=True by default.

1reaction

rthcommented, May 21, 2020

There are generally two issues here,

making stop words compatible given a tokenizer: https://github.com/scikit-learn/scikit-learn/issues/10735#issuecomment-631463525 should work for simple cases.
choosing a tokenizer. For the particular case of won't the regexp from https://github.com/scikit-learn/scikit-learn/pull/7008 would indeed work, however it has wider concerns https://github.com/scikit-learn/scikit-learn/issues/6892#issuecomment-233162541. For any advanced NLP it’s likely better to use a specialized tokenizer package. You can use them with scikit-learn by passing the tokenizer parameter to CountVectorizer. In particular, “won’t” is classically tokenized as `[‘wo’, “n’t”], I think. One could do that that with regex, but with just one regex the limit on handling all special cases is reached pretty fast. So using a specialized tokenizer package is probably better.

Top Results From Across the Web

Creating a Stopwords List - Constellate

We can store our stop words list in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas...

Word list | Google developer documentation style guide

Note: This document includes references to terms that Google considers disrespectful or offensive. The terms are listed here to provide usage guidance and ......

Choose text encoding when you open and save files

When you or someone else opens a text file in Microsoft Word or in another program ... You can open and read Unicode-encoded...

Chapter 3 Stop words

It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case....

Easy Checks – A First Review of Web Accessibility - W3C

More robust assessment is needed to evaluate accessibility comprehensively. This page provides checks for the following specific aspects of a web page.