question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Suggestion: Fill with feature mean instead of 0 for VarianceThreshold.inverse_transform

See original GitHub issue

Today, sklearn.feature_selection.VarianceThreshold.inverse_transform method fills in zeros to features which were removed for having too-small variance. This is certainly predictable, easy to implement and easy to explain.

However, filling in zeros without respect to the data passed to fit means that the reconstruction error can become arbitrarily large. For example, suppose that one of the features in your data always takes the value 10**6. This clearly has zero variance, since it always takes the same value; however, filling in zeros for that feature when the data via transform and inverse_transform will result in an output which dramatically differs from the input.

Instead, I think it would make sense to fill in the mean of the columns when using inverse_transform. The means can be computed and stored from the data passed to fit. This will make the reconstruction of the transformed data via inverse_transform more closely reflect the data that was passed to fit because any columns which are removed for having variance less than threshold must, by definition, be tightly grouped about the sample mean.

Naturally, in the special cases where the sample means of input features passed to fit are already zero, the proposed inverse_transform method will function the same way it does today, as it will fill in zero values for those features.

In terms of code, this just means keeping an array of column means in addition to the indices of the removed columns.

Of course, it’s possible that I have missed an important subtlety, or there is a competing concern which outweighs the argument that I’ve outlined here. If that’s the case, I’d like to know what I’ve missed!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Nov 2, 2019

So I’m not sure if this change will be beneficial. Why should VarianceThreshold have different behavior from other feature selection methods? You could argue the same for other feature selectors, right?

I’m also not sure I understand the semantics of looking at reconstruction error in feature selection. @Sycor4x can you explain why you’re doing this or what the interpretation of this is?

@jnothman did you mean adding a parameter to the method or the class? Do we want to later deprecate the parameter or keep it?

0reactions
Sycor4xcommented, Feb 2, 2022

Somehow I missed the notifications for this. My recollection from several years ago is that I expected transform and inverse_transform to be inverses. To be explicit, if Z = obj.fit_transform(X), then my expectations is that np.abs(X - obj.inverse_transform(Z).mean() to be small, for instance.

The reason I cared, as far as I can recall, is that if I’m applying a sequence of transformations to X (e.g. a Pipeline), and I want to assess how much information is discarded end-to-end, this has a sharp corner because “un-doing” the zero-variance screen can be surprisingly large, and the reason it is so large has more to do with how the back-transformation is performed than the data or the Pipeline.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.feature_selection.VarianceThreshold
Feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and...
Read more >
Dropping Constant Features using VarianceThreshold ...
Variance Threshold is a feature selector that removes all the low variance features from the dataset that are of no great use in...
Read more >
Retain feature names after Scikit Feature Selection
After running a Variance Threshold from Scikit-Learn ...
Read more >
How to Use Variance Thresholding For Robust Feature ...
Go through a hands-on tutorial on Variance Thresholding with Scikit-learn's VarianceThreshold estimator. Same performance even after 50 ...
Read more >
Tutorial 1- Feature Selection-How To Drop ... - YouTube
In this video I am going to start a new playlist on Feature Selection ... about how we can drop constant features using...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found