CountVectorizer token_pattern issue with multi Alternative regex pattern.
See original GitHub issueDescription
When using the custom token_pattern with CountVectorize returns no feature names. Am i missing something or
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer(token_pattern=r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?") # Trying to use the custom-designed regex pattern, which matches the tokens as per our needs.
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Expected Results
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Actual Results
['']
Alrternate Code but not using Scikit
i tried the same pattern with a simple regex tokenizer and it works smooth.
import re
def regex_tokenizer(corpus):
scanner = re.compile(r"[a-z]{2,}|([0-9]{1,3})(?:st|nd|rd|th)?")
return [match.group() for match in re.finditer(scanner, corpus)]
corpus = 'This is the first document.'
print(regex_tokenizer(corpus))
####Results as Expected.
['his', 'is', 'the', 'first', 'document']
Versions
For scikit-learn >= 0.20.1:
Issue Analytics
- State:
- Created 5 years ago
- Comments:23 (12 by maintainers)
Top Results From Across the Web
Only words or numbers re pattern. Tokenize with ...
I'm using python CountVectorizer to ...
Read more >sklearn CountVectorizer token_pattern -- skip token if ...
My thought was to use CountVectorizer 's token_pattern argument to supply a regex string that will match anything except one or more numbers ......
Read more >Re-learning regexes to help with NLP – Be Good, Work Hard, Get ...
Is there an alternative approach? ... Or have a look at Scikit-learn's CountVectorizer which uses a very simply regex for the token-pattern that...
Read more >Practical Text Classification With Python and Keras
The token pattern itself defaults to token_pattern='(?u)\b\w\w+\b' , which is a regex pattern that says, “a word is 2 or more Unicode word...
Read more >Working with Text data — Applied Machine Learning in Python
This is pretty simple and pretty restrictive. Changing the token pattern regex¶. vect = CountVectorizer(token_pattern ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The number of capturing groups can be counted with the
.groups
attribute of a compiled pattern.I think we should:
It seems we have a bug (or feature) through the use of
re.findall
. The contents of the capturing group([0-9]{1,3})
is being returned as the token. Make it non-capturing(?:[0-9]{1,3})
as one solution. I think this is probably historically a bug, but whether users have exploited it is hard to tell, so we may not be able to easily fix it in a backwards-compatible way.