Missing features removal with SimpleImputer
See original GitHub issueCode sample
In the sample code below, a column is removed from the dataset during the pipeline
>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imp = SimpleImputer()
>>> imp.fit([[0, np.nan], [1, np.nan]])
>>> imp.transform([[0, np.nan], [1, 1]])
array([[0.],
[1.]])
Problem description
Currently sklearn.impute.SimpleImputer
silently removes features that are np.nan
on every training sample.
This may cause further issues on pipelines because the dataset’s shape
has changed, e.g.
dataset[:, columns_to_impute_with_median] = imp.fit_transform(dataset[:, columns_to_impute_with_median])
Possible solutions
For the problematic features, either keep their values if valid or impute the fill_value
during transform
. I suggest adding a new parameter to trigger this behaviour with a warning highlighting the referred features.
As I’m willing to implement this feature, I look forward advices.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:6
- Comments:20 (16 by maintainers)
Top Results From Across the Web
Does SimpleImputer remove features? - Stack Overflow
SimpleImputer.html: 'Columns which only contained missing values at fit are discarded upon transform if strategy is not “constant”'.
Read more >sklearn.impute.SimpleImputer
Multivariate imputer that estimates missing features using nearest samples. Notes. Columns which only contained missing values at fit are discarded upon ...
Read more >Imputing Missing Values using the SimpleImputer Class in ...
In this article, I will show you how to use the SimpleImputer class in sklearn to quickly and easily replace missing values in...
Read more >ML | Handle Missing Data with Simple Imputer - GeeksforGeeks
SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values ......
Read more >Retrieve dropped column names from `sklearn.impute ...
SimpleImputer drops columns consisting entirely of missing values. ... get_feature_names method to find out when/if a column was removed.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I don’t see how this leading to a user friendly API, and I don’t see why we can’t have an option to uphold both properties A and B by imputing empty columns with a constant value.
I think if we solved the feature names issue this would be a much smaller problem…