Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use a stable sorting algorithm when selecting the `max_features` in TfidfVectorizer

See original GitHub issue

Describe the workflow you want to enable

Currently when selecting max_features in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort enabled (the default argument).

Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())   # ['aaa' 'aab' 'aac' 'aad' 'aae']

As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
          'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['aaa' 'aca' 'acb' 'ada' 'aea']

I would expect the result to be the same as the first example but is not.

Describe your proposed solution

Since 1.15.0 the option stable was added to numpy.argsort, so simply changing the line to:

mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]

should be enough to solve the problem.

Describe alternatives you’ve considered, if relevant

As an alternative use the option mergesort that is stable

mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]

Additional context

sklearn.__version__
'1.0'

Issue Analytics

State:
Created 2 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

thomasjpfancommented, Oct 26, 2021

For backward compatibility, I think we still need to add a parameter to TfidfVectorizer to control the sorting behavior.

0reactions

ogriselcommented, Feb 26, 2022

I think I prefer to be explicit and introduce a new option to enable the switch to the new behavior explicitly with a future warning otherwise.

Also, once #22617 is merged we will be able to get use .argsort(kind="stable").

Top Results From Across the Web

How to correctly use TF-IDF with imbalanced data

The purpose of max_features is to limit the number of features (words) from the dataset for which we want to calculate the TF-IDF...

Stable Vs Unstable Sort [FULL] - YouTube

I have taken Bubble sort for stable sort algorithm and Selection sort for unstable sorting algorithm for better explanation and easier ...

Analyzing Documents with TF-IDF

We can use the sort() method to put the files in ascending numerical order and print the first file to make sure it's...

sklearn.feature_extraction.text.TfidfVectorizer

Examples using sklearn.feature_extraction.text.TfidfVectorizer: Biclustering documents with the Spectral Co-clustering algorithm Biclustering documents with ...

Select only top n features from Tfidf Vectorizer

Set the max_features argument to 100 . Refer the docs here.