Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer
See original GitHub issueDescribe the workflow you want to enable
Currently when selecting max_features
in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort
enabled (the default argument).
Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:
corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['aaa' 'aab' 'aac' 'aad' 'aae']
As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:
corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['aaa' 'aca' 'acb' 'ada' 'aea']
I would expect the result to be the same as the first example but is not.
Describe your proposed solution
Since 1.15.0
the option stable was added to numpy.argsort, so simply changing the line to:
mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]
should be enough to solve the problem.
Describe alternatives you’ve considered, if relevant
As an alternative use the option mergesort
that is stable
mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]
Additional context
sklearn.__version__
'1.0'
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
How to correctly use TF-IDF with imbalanced data
The purpose of max_features is to limit the number of features (words) from the dataset for which we want to calculate the TF-IDF...
Read more >Stable Vs Unstable Sort [FULL] - YouTube
I have taken Bubble sort for stable sort algorithm and Selection sort for unstable sorting algorithm for better explanation and easier ...
Read more >Analyzing Documents with TF-IDF
We can use the sort() method to put the files in ascending numerical order and print the first file to make sure it's...
Read more >sklearn.feature_extraction.text.TfidfVectorizer
Examples using sklearn.feature_extraction.text.TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm Biclustering documents with ...
Read more >Select only top n features from Tfidf Vectorizer
Set the max_features argument to 100 . Refer the docs here.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For backward compatibility, I think we still need to add a parameter to
TfidfVectorizer
to control the sorting behavior.I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.
Also, once #22617 is merged we will be able to get use
.argsort(kind="stable")
.