question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CountVectorizer does not lowercase() entries in vocabulary when `lowercase` is set to `True`

See original GitHub issue

Describe the bug

The default value of for lowercase in CountVectorizer is True. This has the effect that all content of documents is lowercased by default. However, the entries in the vocabulary are not lowercased. So if the vocabulary contains uppercase characters it won’t match against the content in the documents. I think CountVectorizer should either

  1. lowercase the vocabulary as well when lowercase is True or
  2. not allow upper case characters in the vocabulary when lowercase is True

Steps/Code to Reproduce

Expected Results

Actual Results

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def test_count_vectorizer():
    voc = ["A", "B", "C"]
    documents = ["A B C"]

    count_model = CountVectorizer(
        ngram_range=(1, 1),
        vocabulary=voc,
    )
    x = count_model.fit_transform(documents).toarray()
    assert np.array_equal(x, [[1, 1, 1]])  # x is [[0, 0, 0]]; should be [[1, 1, 1]]

Versions

   setuptools: 51.0.0
      sklearn: 0.23.2
        numpy: 1.19.4
        scipy: 1.5.4
       Cython: 0.29.21
       pandas: 1.1.5
   matplotlib: 3.3.3
       joblib: 1.0.0
threadpoolctl: 2.1.0
Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
zitorelovacommented, Feb 3, 2021

take

0reactions
philipp-eisencommented, Feb 3, 2021

Yeah that’s what I meant with the message before. Sorry for not being very clear.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CountVectorizer converts words to lower case - Stack Overflow
CountVectorizer has a parameter lowercase that defaults to True . In order to disable this behavior, you need to set lowercase=False as ...
Read more >
sklearn.feature_extraction.text.CountVectorizer
Convert all characters to lowercase before tokenizing. ... If None, no stop words will be used. max_df can be set to a value...
Read more >
Basics of CountVectorizer | by Pratyaksh Jain
Convert all characters to lowercase before tokenizing. Default is set to true and takes boolean value. ... Stopwords are the words in any...
Read more >
10+ Examples for Using CountVectorizer - Kavita Ganesan, PhD
In this article, we are going to go in-depth into the different ways you can use CountVectorizer such that you are not just...
Read more >
Spam Filtering Using Bag-of-Words | by Aditi Mukerjee - Medium
Convering all the words in lower case; Removing puntuation; Remove stop words. ... Note we are not fitting the testing data into the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found