CountVectorizer does not lowercase() entries in vocabulary when `lowercase` is set to `True`
See original GitHub issueDescribe the bug
The default value of for lowercase
in CountVectorizer is True
. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won’t match against the content in the documents.
I think CountVectorizer should either
- lowercase the vocabulary as well when
lowercase
isTrue
or - not allow upper case characters in the vocabulary when
lowercase
isTrue
Steps/Code to Reproduce
Expected Results
Actual Results
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def test_count_vectorizer():
voc = ["A", "B", "C"]
documents = ["A B C"]
count_model = CountVectorizer(
ngram_range=(1, 1),
vocabulary=voc,
)
x = count_model.fit_transform(documents).toarray()
assert np.array_equal(x, [[1, 1, 1]]) # x is [[0, 0, 0]]; should be [[1, 1, 1]]
Versions
setuptools: 51.0.0
sklearn: 0.23.2
numpy: 1.19.4
scipy: 1.5.4
Cython: 0.29.21
pandas: 1.1.5
matplotlib: 3.3.3
joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
CountVectorizer converts words to lower case - Stack Overflow
CountVectorizer has a parameter lowercase that defaults to True . In order to disable this behavior, you need to set lowercase=False as ...
Read more >sklearn.feature_extraction.text.CountVectorizer
Convert all characters to lowercase before tokenizing. ... If None, no stop words will be used. max_df can be set to a value...
Read more >Basics of CountVectorizer | by Pratyaksh Jain
Convert all characters to lowercase before tokenizing. Default is set to true and takes boolean value. ... Stopwords are the words in any...
Read more >10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD
In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just...
Read more >Spam Filtering Using Bag-of-Words | by Aditi Mukerjee - Medium
Convering all the words in lower case; Removing puntuation; Remove stop words. ... Note we are not fitting the testing data into the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
take
Yeah that’s what I meant with the message before. Sorry for not being very clear.