More detailed instructions needed for making (non-English) stop word lists compatible
See original GitHub issueThis issue relates to #10735 re: the improvement of stop word lists.
As stated in the previous discussion, a more detailed documentation for making custom stop-word lists compatible would be more helpful than an updated in-built stop-word list as use cases even within the English language can be very specific.
One of my corpora is a collection of Hiberno-English letters using many outdated word forms as well as words derived from Irish-Gaelic. The txt-files also contain occassional UTF-8 errors and orphaned XML tags inherited from the original documents.
While I acknowledge that removing “lb” for unresolved line-break tags may best be done in the pre-processing, I am also having issues with word forms such as “we’ll”, “won’t” or “'tis” which use apostrophes. Besides, appreviated words such as “oct” for “October” seem to be problematic.
I have changed my stopword list multiple times to include both word forms with and without apostrophes, but although both "October and “Oct” are on my list, “oct” is still being ignored.
Here is a sample script that might help to update the documentation:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer
docs=["Tirconaill tír st sráid Oct 30th 22 A Chait A chara Dhílis 20th I you your me mine faith faithful faithfully ye get got we'll I'll 'tis le length legacy tá tú bhí sé sí & amp am"]
my_stopwords=["'tis","tis","a", "amp", "30th","Oct","22","ye","I","you","20th","me","get","st", "sráid", "we'll", "ll", "le", "tú", "bhí", "faithfully", "tír"]
print(CountVectorizer(stop_words=my_stopwords).fit_transform(docs).A)
vectorizer=text.CountVectorizer(input='docs', stop_words=my_stopwords)
dtm=vectorizer.fit_transform(docs).toarray()
vocab=np.array(vectorizer.get_feature_names())
print(dtm.shape)
print(vocab)
print(len(vocab))
num_topics=2
num_top_words=15
clf=decomposition.NMF(n_components=num_topics, random_state=1)
doctopic=clf.fit_transform(dtm)
topic_words=[]
for topic in clf.components_:
word_idx=np.argsort(topic)[::-1][0:num_top_words]
topic_words.append([vocab[i] for i in word_idx])
print(topic_words)
The output I got was this:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]] (1, 17) [‘am’ ‘chait’ ‘chara’ ‘dhílis’ ‘faith’ ‘faithful’ ‘got’ ‘legacy’ ‘length’ ‘mine’ ‘oct’ ‘sé’ ‘sí’ ‘tirconaill’ ‘tá’ ‘we’ ‘your’] 17 [[‘chait’, ‘mine’, ‘sé’, ‘tá’, ‘got’, ‘sí’, ‘chara’, ‘length’, ‘we’, ‘tirconaill’, ‘your’, ‘am’, ‘legacy’, ‘faithful’, ‘oct’], [‘faith’, ‘dhílis’, ‘oct’, ‘faithful’, ‘legacy’, ‘am’, ‘your’, ‘tirconaill’, ‘we’, ‘length’, ‘chara’, ‘sí’, ‘got’, ‘tá’, ‘sé’]]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens [‘oct’, ‘we’] not in stop_words. ‘stop_words.’ % sorted(inconsistent))
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
I think the main issue you’re running up against is that stopwords are matched after normalization, so they need to be lowercase. That probably needs to be pointed out more explicitly in the docs as
lower_case=True
by default.There are generally two issues here,
won't
the regexp from https://github.com/scikit-learn/scikit-learn/pull/7008 would indeed work, however it has wider concerns https://github.com/scikit-learn/scikit-learn/issues/6892#issuecomment-233162541. For any advanced NLP it’s likely better to use a specialized tokenizer package. You can use them with scikit-learn by passing thetokenizer
parameter toCountVectorizer
. In particular, “won’t” is classically tokenized as `[‘wo’, “n’t”], I think. One could do that that with regex, but with just one regex the limit on handling all special cases is reached pretty fast. So using a specialized tokenizer package is probably better.