question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

More detailed instructions needed for making (non-English) stop word lists compatible

See original GitHub issue

This issue relates to #10735 re: the improvement of stop word lists.

As stated in the previous discussion, a more detailed documentation for making custom stop-word lists compatible would be more helpful than an updated in-built stop-word list as use cases even within the English language can be very specific.

One of my corpora is a collection of Hiberno-English letters using many outdated word forms as well as words derived from Irish-Gaelic. The txt-files also contain occassional UTF-8 errors and orphaned XML tags inherited from the original documents.

While I acknowledge that removing “lb” for unresolved line-break tags may best be done in the pre-processing, I am also having issues with word forms such as “we’ll”, “won’t” or “'tis” which use apostrophes. Besides, appreviated words such as “oct” for “October” seem to be problematic.

I have changed my stopword list multiple times to include both word forms with and without apostrophes, but although both "October and “Oct” are on my list, “oct” is still being ignored.

Here is a sample script that might help to update the documentation:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

docs=["Tirconaill tír st sráid Oct 30th 22 A Chait A chara Dhílis 20th I you your me mine faith faithful faithfully ye get got we'll I'll 'tis le length legacy tá tú bhí sé sí &amp amp am"]                                
my_stopwords=["'tis","tis","a", "amp", "30th","Oct","22","ye","I","you","20th","me","get","st", "sráid", "we'll", "ll", "le", "tú", "bhí", "faithfully", "tír"]

print(CountVectorizer(stop_words=my_stopwords).fit_transform(docs).A)

vectorizer=text.CountVectorizer(input='docs', stop_words=my_stopwords) 

dtm=vectorizer.fit_transform(docs).toarray() 

vocab=np.array(vectorizer.get_feature_names())

print(dtm.shape)
print(vocab)
print(len(vocab))

num_topics=2

num_top_words=15

clf=decomposition.NMF(n_components=num_topics, random_state=1)

doctopic=clf.fit_transform(dtm) 

topic_words=[] 
for topic in clf.components_:
   
    word_idx=np.argsort(topic)[::-1][0:num_top_words]
    topic_words.append([vocab[i] for i in word_idx])
    
print(topic_words) 

The output I got was this:

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]] (1, 17) [‘am’ ‘chait’ ‘chara’ ‘dhílis’ ‘faith’ ‘faithful’ ‘got’ ‘legacy’ ‘length’ ‘mine’ ‘oct’ ‘sé’ ‘sí’ ‘tirconaill’ ‘tá’ ‘we’ ‘your’] 17 [[‘chait’, ‘mine’, ‘sé’, ‘tá’, ‘got’, ‘sí’, ‘chara’, ‘length’, ‘we’, ‘tirconaill’, ‘your’, ‘am’, ‘legacy’, ‘faithful’, ‘oct’], [‘faith’, ‘dhílis’, ‘oct’, ‘faithful’, ‘legacy’, ‘am’, ‘your’, ‘tirconaill’, ‘we’, ‘length’, ‘chara’, ‘sí’, ‘got’, ‘tá’, ‘sé’]]

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens [‘oct’, ‘we’] not in stop_words. ‘stop_words.’ % sorted(inconsistent))

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
amuellercommented, May 21, 2020

I think the main issue you’re running up against is that stopwords are matched after normalization, so they need to be lowercase. That probably needs to be pointed out more explicitly in the docs as lower_case=True by default.

1reaction
rthcommented, May 21, 2020

There are generally two issues here,

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating a Stopwords List - Constellate
We can store our stop words list in a CSV file. A CSV, or "Comma-Separated Values" file, is a plain-text file with commas...
Read more >
Word list | Google developer documentation style guide
Note: This document includes references to terms that Google considers disrespectful or offensive. The terms are listed here to provide usage guidance and ......
Read more >
Choose text encoding when you open and save files
When you or someone else opens a text file in Microsoft Word or in another program ... You can open and read Unicode-encoded...
Read more >
Chapter 3 Stop words
It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case....
Read more >
Easy Checks – A First Review of Web Accessibility - W3C
More robust assessment is needed to evaluate accessibility comprehensively. This page provides checks for the following specific aspects of a web page.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found