Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CountVectorizer does not lowercase() entries in vocabulary when `lowercase` is set to `True`

See original GitHub issue

Describe the bug

The default value of for lowercase in CountVectorizer is True. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won’t match against the content in the documents. I think CountVectorizer should either

lowercase the vocabulary as well when lowercase is True or
not allow upper case characters in the vocabulary when lowercase is True

Steps/Code to Reproduce

Expected Results

Actual Results

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def test_count_vectorizer():
    voc = ["A", "B", "C"]
    documents = ["A B C"]

    count_model = CountVectorizer(
        ngram_range=(1, 1),
        vocabulary=voc,
    )
    x = count_model.fit_transform(documents).toarray()
    assert np.array_equal(x, [[1, 1, 1]])  # x is [[0, 0, 0]]; should be [[1, 1, 1]]

Versions

   setuptools: 51.0.0
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

zitorelovacommented, Feb 3, 2021

take

0reactions

philipp-eisencommented, Feb 3, 2021

Yeah that’s what I meant with the message before. Sorry for not being very clear.

Read more comments on GitHub >

Top Results From Across the Web

CountVectorizer converts words to lower case - Stack Overflow

CountVectorizer has a parameter lowercase that defaults to True . In order to disable this behavior, you need to set lowercase=False as ...

sklearn.feature_extraction.text.CountVectorizer

Convert all characters to lowercase before tokenizing. ... If None, no stop words will be used. max_df can be set to a value...

Basics of CountVectorizer | by Pratyaksh Jain

Convert all characters to lowercase before tokenizing. Default is set to true and takes boolean value. ... Stopwords are the words in any...

10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD

In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just...

Spam Filtering Using Bag-of-Words | by Aditi Mukerjee - Medium

Convering all the words in lower case; Removing puntuation; Remove stop words. ... Note we are not fitting the testing data into the...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Numerical stability bug with master wheels

Could not install packages due to an OSError