question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer

See original GitHub issue

Describe the workflow you want to enable

Currently when selecting max_features in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort enabled (the default argument).

Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())   # ['aaa' 'aab' 'aac' 'aad' 'aae']

As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
          'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['aaa' 'aca' 'acb' 'ada' 'aea']

I would expect the result to be the same as the first example but is not.

Describe your proposed solution

Since 1.15.0 the option stable was added to numpy.argsort, so simply changing the line to:

mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]

should be enough to solve the problem.

Describe alternatives you’ve considered, if relevant

As an alternative use the option mergesort that is stable

mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]

Additional context

sklearn.__version__
'1.0'

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
thomasjpfancommented, Oct 26, 2021

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

0reactions
ogriselcommented, Feb 26, 2022

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

Also, once #22617 is merged we will be able to get use .argsort(kind="stable").

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to correctly use TF-IDF with imbalanced data
The purpose of max_features is to limit the number of features (words) from the dataset for which we want to calculate the TF-IDF...
Read more >
Stable Vs Unstable Sort [FULL] - YouTube
I have taken Bubble sort for stable sort algorithm and Selection sort for unstable sorting algorithm for better explanation and easier ...
Read more >
Analyzing Documents with TF-IDF
We can use the sort() method to put the files in ascending numerical order and print the first file to make sure it's...
Read more >
sklearn.feature_extraction.text.TfidfVectorizer
Examples using sklearn.feature_extraction.text.TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm Biclustering documents with ...
Read more >
Select only top n features from Tfidf Vectorizer
Set the max_features argument to 100 . Refer the docs here.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found