Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TF*IDF yields different results than TfidfTransformer

See original GitHub issue

Describe the bug

I was trying out some TF-IDF for NLP with sklearn. To double check the results, I was expecting to get the same output tfidf_matrix == TF*IDF where TF is the output of the CountVectorizer and IDF is thetfidf_transformer.idf_ I was suprise to see both arrays are not the same (returns False):

Is this a bug or I am missing something silly?

Steps/Code to Reproduce

Expected Results

Both arrays to be the same

Actual Results

Results differ

Versions

System:
    python: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.6
        scipy: 1.4.1
       Cython: 0.29.30
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

gloriamaciacommented, Jun 6, 2022

found it! https://github.com/scikit-learn/scikit-learn/blob/80598905e/sklearn/feature_extraction/text.py#L1469

0reactions

gloriamaciacommented, Jun 11, 2022

@jeremiedbb PR submitted. You were right, norm cannot be True (error is indeed raised during normalization). Thanks for your guidance.

Top Results From Across the Web

Scikit-learn TfidfTranformer yielding wrong results?

I'm getting "weird" results using scikit-learn's Tfidf transformer. Normally, I would expect a word, that occurs in all documents in a ...

How to Use Tfidftransformer & Tfidfvectorizer - A Short Tutorial

1. Dataset and Imports · 2. Initialize CountVectorizer · 3. Compute the IDF values · 4. Compute the TFIDF score for your documents....

What is the difference between CountVectorizer token counts ...

The only difference is that the TfidfVectorizer() returns floats while ... TfidfVectorizer() assigns a score while CountVectorizer() counts.

sklearn.feature_extraction.text.TfidfTransformer

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, ...

Analyzing tf-idf results in scikit-learn - datawerk

The result is a matrix of tf-idf scores with one row per document and as many columns as there are different words in...