question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in Count Vectorizer with Lemmatization and Stop words priority.

See original GitHub issue

Description

I am working on using a pipeline with combination of preprocessing module as Count Vectorizer, TFIDF and Algorithms (set of algorithms), although its working fine with the following settings, but when I add in my own Lemmatizer in it. This generated error as :

UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['le', 'u'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

I went to a deeper dive and while debugging I found out that the tokenizer = Lemmatizer() runs before removal of stop words in the CountVecotorizer() My source code is as follows

Steps/Code to Reproduce

`pipeline = Pipeline(
                        [
                            ('vectorizer', CountVectorizer(stop_words= stop_words, max_df=max_df,input='content', encoding='utf-8',
                                                           decode_error='strict', strip_accents=None, lowercase=True,
                                                           preprocessor=None, tokenizer=LemmaTokenizer(),token_pattern='(?u)\\b\\w\\w+\\b',
                                                           ngram_range=(1, 1), analyzer='word',  min_df=min_df, max_features=None,
                                                           vocabulary=None, binary=False)
                            ),
                            ('tfidf', TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)), # Inverse Document Frequency

                            ('classifier', model) # Model
                        ]
                    )`

Lemma tokenizer is as follows :

class LemmaTokenizer(object):
    def __call__(self, text):
        return [WordNetLemmatizer().lemmatize(t)
                for t in word_tokenize(text)
                if t not in stopwords.words("english") and
        t not in string.punctuation]

There should be a way where we coult specify the follow of what to run first (stop words removal first and then lemmatization would be a better technique although I know some people prefer that lemmatization should be done before so that other forms of stop words can be taken back to the base form.

I have tried to do it in preprocessing, that will work becuase there are seperate function , although is there anyway that I can have it within pipeline ? Thanks.

Versions

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
jnothmancommented, Aug 7, 2019

Unless you can get the message a whole lot clearer, it might be better to leave it as it is so that people can find relevant answers here and on stack overflow.

I don’t think the message says anything about generating a stop word list, but yes the message might be a little too succinct. And a UserWarning, to me, means the user did something problematic. I don’t know how you infer from that that everything will be alright: were we more confident that the user did this unintentionally and with bad consequence, we would raise an error.

I am very glad this warning is having its intended effect: informing the user that there is something wrong with their stop list and provoking some thought.

2reactions
jnothmancommented, Aug 6, 2019

The point here is that your stop word list needs to be normalised (lemmatised, etc) too if you want the entries in that list to be stopped from your normalised tokens. As far as I can tell the warning is triggered correctly. But maybe the warning isn’t clear enough about how to fix the problem. Improvement to the message are welcome, but answers on the stack overflow questions will also help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Lemmatization on CountVectorizer doesn't remove Stopwords
1) You lemmatize the stopwords set itself, and then pass it to stop_words param in CountVectorizer. my_stop_words = [lemma(t) for t in ...
Read more >
Text Classification 2: Natural Language Toolkit
Specifically we will discuss stop words, stemming and lemmatization on the ... The CountVectorizer class also has the ability to remove stop words...
Read more >
Chapter 6 Regression
Let's explore how to train a model using lemmas instead of words. Lemmatization is, like choices around n-grams and stop words, part of...
Read more >
Word Embeddings with Word2Vec Tutorial: All you Need to ...
As seen, the countvectorizer simply counts the number of times a word occurs in the corpus. This is what is fed into the...
Read more >
Creating an index > NLP and tokenization > Ignoring stop-words
Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming). By default, stop words...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found