Design of add_indicator in SimpleImputer may fail when running cross validation
See original GitHub issueDescription
The design of add_indicator depends on missing values exist in the training data. This will break cross validation.
Steps/Code to Reproduce
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X = np.array([[1, 2, 3, np.nan]]).T
y = np.array([0, 0, 1, 1])
test_fold = np.array([0, 1, 0, 1])
ps = PredefinedSplit(test_fold)
pipe1 = make_pipeline(SimpleImputer(add_indicator=True),
LogisticRegression(solver='lbfgs'))
cross_val_score(pipe1, X, y, cv=ps)
Expected Results
No error is thrown.
Actual Results
ValueError: The features [0] have missing values in transform
but have no missing values in fit.
Thoughts
The original design was adopted because, if the training data has no missing value, there will be a column with all zeros. This type of error will appear when we try to do grid search over the add_indicator parameter. One way to work around this is to split the data in such a way that missing values are available (for each column that has a missing value) in both the training set and test set.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
Top Results From Across the Web
sklearn.impute.SimpleImputer
The imputation strategy. If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
Read more >How to Handle Missing Values in Cross Validation
The SimpleImputer fills in missing values based on the given strategy. ... We can now use this pipeline as estimator in cross validation....
Read more >How to Fix k-Fold Cross-Validation for Imbalanced Classification
How a naive application of k-fold cross-validation and train-test splits will fail when evaluating classifiers on imbalanced datasets.
Read more >How to use cross validation after imputing on a training and ...
I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the ......
Read more >Imputation when estimating prediction error - Cross Validated
Use the imputed test data to get an estimate of the prediction error. When building the imputation model I would include the outcome...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@pmattioli for the MissingIndicator that’s constructed in SimpleImputer.
@jnothman do you mean we should have error_on_new=False for MissingIndicator?