question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CountVectorizer token_pattern issue with multi Alternative regex pattern.

See original GitHub issue

Description

When using the custom token_pattern with CountVectorize returns no feature names. Am i missing something or

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
vectorizer = CountVectorizer(token_pattern=r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?") # Trying to use the custom-designed regex pattern, which matches the tokens as per our needs.
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

Expected Results

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Actual Results

 [''] 

Alrternate Code but not using Scikit

i tried the same pattern with a simple regex tokenizer and it works smooth.

import re
def regex_tokenizer(corpus):
    scanner = re.compile(r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?")
    return [match.group() for match in re.finditer(scanner, corpus)]


corpus = 'This is the first document.'
print(regex_tokenizer(corpus))

####Results as Expected.

['his', 'is', 'the', 'first', 'document']

Versions

For scikit-learn >= 0.20.1:

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:23 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Jan 14, 2019

The number of capturing groups can be counted with the .groups attribute of a compiled pattern.

I think we should:

  • change the code to raise an error if the number of groups is greater than 1, since this wouldn’t work atm (I think?)
  • document the behaviour when there is a capturing group (i.e. the captured group content, not the entire match, becomes the token)
  • ?deprecate this support for a capturing group
2reactions
jnothmancommented, Jan 14, 2019

It seems we have a bug (or feature) through the use of re.findall. The contents of the capturing group ([0-9]{1,3}) is being returned as the token. Make it non-capturing (?:[0-9]{1,3}) as one solution. I think this is probably historically a bug, but whether users have exploited it is hard to tell, so we may not be able to easily fix it in a backwards-compatible way.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Only words or numbers re pattern. Tokenize with ...
I'm using python CountVectorizer to ...
Read more >
sklearn CountVectorizer token_pattern -- skip token if ...
My thought was to use CountVectorizer 's token_pattern argument to supply a regex string that will match anything except one or more numbers ......
Read more >
Re-learning regexes to help with NLP – Be Good, Work Hard, Get ...
Is there an alternative approach? ... Or have a look at Scikit-learn's CountVectorizer which uses a very simply regex for the token-pattern that...
Read more >
Practical Text Classification With Python and Keras
The token pattern itself defaults to token_pattern='(?u)\b\w\w+\b' , which is a regex pattern that says, “a word is 2 or more Unicode word...
Read more >
Working with Text data — Applied Machine Learning in Python
This is pretty simple and pretty restrictive. Changing the token pattern regex¶. vect = CountVectorizer(token_pattern ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found