Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in Count Vectorizer with Lemmatization and Stop words priority.

See original GitHub issue

Description

I am working on using a pipeline with combination of preprocessing module as Count Vectorizer, TFIDF and Algorithms (set of algorithms), although its working fine with the following settings, but when I add in my own Lemmatizer in it. This generated error as :

UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['le', 'u'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

I went to a deeper dive and while debugging I found out that the tokenizer = Lemmatizer() runs before removal of stop words in the CountVecotorizer() My source code is as follows

Steps/Code to Reproduce

`pipeline = Pipeline(
                        [
                            ('vectorizer', CountVectorizer(stop_words= stop_words, max_df=max_df,input='content', encoding='utf-8',
                                                           decode_error='strict', strip_accents=None, lowercase=True,
                                                           preprocessor=None, tokenizer=LemmaTokenizer(),token_pattern='(?u)\\b\\w\\w+\\b',
                                                           ngram_range=(1, 1), analyzer='word',  min_df=min_df, max_features=None,
                                                           vocabulary=None, binary=False)
                            ),
                            ('tfidf', TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)), # Inverse Document Frequency

                            ('classifier', model) # Model
                        ]
                    )`

Lemma tokenizer is as follows :

class LemmaTokenizer(object):
    def __call__(self, text):
        return [WordNetLemmatizer().lemmatize(t)
                for t in word_tokenize(text)
                if t not in stopwords.words("english") and
        t not in string.punctuation]

There should be a way where we coult specify the follow of what to run first (stop words removal first and then lemmatization would be a better technique although I know some people prefer that lemmatization should be done before so that other forms of stop words can be taken back to the base form.

I have tried to do it in preprocessing, that will work becuase there are seperate function , although is there anyway that I can have it within pipeline ? Thanks.

Versions

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:11 (6 by maintainers)

Top GitHub Comments

3reactions

jnothmancommented, Aug 7, 2019

Unless you can get the message a whole lot clearer, it might be better to leave it as it is so that people can find relevant answers here and on stack overflow.

I don’t think the message says anything about generating a stop word list, but yes the message might be a little too succinct. And a UserWarning, to me, means the user did something problematic. I don’t know how you infer from that that everything will be alright: were we more confident that the user did this unintentionally and with bad consequence, we would raise an error.

I am very glad this warning is having its intended effect: informing the user that there is something wrong with their stop list and provoking some thought.

2reactions

jnothmancommented, Aug 6, 2019

The point here is that your stop word list needs to be normalised (lemmatised, etc) too if you want the entries in that list to be stopped from your normalised tokens. As far as I can tell the warning is triggered correctly. But maybe the warning isn’t clear enough about how to fix the problem. Improvement to the message are welcome, but answers on the stack overflow questions will also help.