Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CountVectorizer token_pattern issue with multi Alternative regex pattern.

See original GitHub issue

Description

When using the custom token_pattern with CountVectorize returns no feature names. Am i missing something or

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
vectorizer = CountVectorizer(token_pattern=r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?") # Trying to use the custom-designed regex pattern, which matches the tokens as per our needs.
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

Expected Results

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Actual Results

 ['']

Alrternate Code but not using Scikit

i tried the same pattern with a simple regex tokenizer and it works smooth.

import re
def regex_tokenizer(corpus):
    scanner = re.compile(r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?")
    return [match.group() for match in re.finditer(scanner, corpus)]


corpus = 'This is the first document.'
print(regex_tokenizer(corpus))

####Results as Expected.

['his', 'is', 'the', 'first', 'document']

Versions

For scikit-learn >= 0.20.1:

Issue Analytics

State:
Created 5 years ago
Comments:23 (12 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Jan 14, 2019

The number of capturing groups can be counted with the .groups attribute of a compiled pattern.

I think we should:

change the code to raise an error if the number of groups is greater than 1, since this wouldn’t work atm (I think?)
document the behaviour when there is a capturing group (i.e. the captured group content, not the entire match, becomes the token)
?deprecate this support for a capturing group

2reactions

jnothmancommented, Jan 14, 2019

It seems we have a bug (or feature) through the use of re.findall. The contents of the capturing group ([0-9]{1,3}) is being returned as the token. Make it non-capturing (?:[0-9]{1,3}) as one solution. I think this is probably historically a bug, but whether users have exploited it is hard to tell, so we may not be able to easily fix it in a backwards-compatible way.