TF*IDF yields different results than TfidfTransformer
See original GitHub issueDescribe the bug
I was trying out some TF-IDF for NLP with sklearn. To double check the results, I was expecting to get the same output tfidf_matrix == TF*IDF where TF is the output of the CountVectorizer
and IDF is thetfidf_transformer.idf_
I was suprise to see both arrays are not the same (returns False
):
Is this a bug or I am missing something silly?
Steps/Code to Reproduce
Expected Results
Both arrays to be the same
Actual Results
Results differ
Versions
System:
python: 3.7.13 (default, Apr 24 2022, 01:04:09) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.2
numpy: 1.21.6
scipy: 1.4.1
Cython: 0.29.30
pandas: 1.3.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Scikit-learn TfidfTranformer yielding wrong results?
I'm getting "weird" results using scikit-learn's Tfidf transformer. Normally, I would expect a word, that occurs in all documents in a ...
Read more >How to Use Tfidftransformer & Tfidfvectorizer - A Short Tutorial
1. Dataset and Imports · 2. Initialize CountVectorizer · 3. Compute the IDF values · 4. Compute the TFIDF score for your documents....
Read more >What is the difference between CountVectorizer token counts ...
The only difference is that the TfidfVectorizer() returns floats while ... TfidfVectorizer() assigns a score while CountVectorizer() counts.
Read more >sklearn.feature_extraction.text.TfidfTransformer
Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, ...
Read more >Analyzing tf-idf results in scikit-learn - datawerk
The result is a matrix of tf-idf scores with one row per document and as many columns as there are different words in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
found it! https://github.com/scikit-learn/scikit-learn/blob/80598905e/sklearn/feature_extraction/text.py#L1469
@jeremiedbb PR submitted. You were right,
norm
cannot beTrue
(error is indeed raised during normalization). Thanks for your guidance.