question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TF*IDF yields different results than TfidfTransformer

See original GitHub issue

Describe the bug

I was trying out some TF-IDF for NLP with sklearn. To double check the results, I was expecting to get the same output tfidf_matrix == TF*IDF where TF is the output of the CountVectorizer and IDF is thetfidf_transformer.idf_ I was suprise to see both arrays are not the same (returns False):

image image

Is this a bug or I am missing something silly?

Steps/Code to Reproduce

Expected Results

Both arrays to be the same

Actual Results

Results differ

Versions

System:
    python: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.2
        numpy: 1.21.6
        scipy: 1.4.1
       Cython: 0.29.30
       pandas: 1.3.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

0reactions
gloriamaciacommented, Jun 11, 2022

@jeremiedbb PR submitted. You were right, norm cannot be True (error is indeed raised during normalization). Thanks for your guidance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scikit-learn TfidfTranformer yielding wrong results?
I'm getting "weird" results using scikit-learn's Tfidf transformer. Normally, I would expect a word, that occurs in all documents in a ...
Read more >
How to Use Tfidftransformer & Tfidfvectorizer - A Short Tutorial
1. Dataset and Imports · 2. Initialize CountVectorizer · 3. Compute the IDF values · 4. Compute the TFIDF score for your documents....
Read more >
What is the difference between CountVectorizer token counts ...
The only difference is that the TfidfVectorizer() returns floats while ... TfidfVectorizer() assigns a score while CountVectorizer() counts.
Read more >
sklearn.feature_extraction.text.TfidfTransformer
Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, ...
Read more >
Analyzing tf-idf results in scikit-learn - datawerk
The result is a matrix of tf-idf scores with one row per document and as many columns as there are different words in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found