Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design of add_indicator in SimpleImputer may fail when running cross validation

See original GitHub issue

Description

The design of add_indicator depends on missing values exist in the training data. This will break cross validation.

Steps/Code to Reproduce

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

X = np.array([[1, 2, 3, np.nan]]).T
y = np.array([0, 0, 1, 1])
test_fold = np.array([0, 1, 0, 1])

ps = PredefinedSplit(test_fold)
pipe1 = make_pipeline(SimpleImputer(add_indicator=True), 
                      LogisticRegression(solver='lbfgs'))

cross_val_score(pipe1, X, y, cv=ps)

Expected Results

No error is thrown.

Actual Results

ValueError: The features [0] have missing values in transform 
but have no missing values in fit.

Thoughts

The original design was adopted because, if the training data has no missing value, there will be a column with all zeros. This type of error will appear when we try to do grid search over the add_indicator parameter. One way to work around this is to split the data in such a way that missing values are available (for each column that has a missing value) in both the training set and test set.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

amuellercommented, Jun 12, 2019

@pmattioli for the MissingIndicator that’s constructed in SimpleImputer.

0reactions

pmattiolicommented, Jun 12, 2019

We should have error_on_new=False by default for add_indicator.

@jnothman do you mean we should have error_on_new=False for MissingIndicator?

Top Results From Across the Web

sklearn.impute.SimpleImputer

The imputation strategy. If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

How to Handle Missing Values in Cross Validation

The SimpleImputer fills in missing values based on the given strategy. ... We can now use this pipeline as estimator in cross validation....

How to Fix k-Fold Cross-Validation for Imbalanced Classification

How a naive application of k-fold cross-validation and train-test splits will fail when evaluating classifiers on imbalanced datasets.

How to use cross validation after imputing on a training and ...

I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the ......

Imputation when estimating prediction error - Cross Validated

Use the imputed test data to get an estimate of the prediction error. When building the imputation model I would include the outcome...