question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design of add_indicator in SimpleImputer may fail when running cross validation

See original GitHub issue

Description

The design of add_indicator depends on missing values exist in the training data. This will break cross validation.

Steps/Code to Reproduce

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

X = np.array([[1, 2, 3, np.nan]]).T
y = np.array([0, 0, 1, 1])
test_fold = np.array([0, 1, 0, 1])

ps = PredefinedSplit(test_fold)
pipe1 = make_pipeline(SimpleImputer(add_indicator=True), 
                      LogisticRegression(solver='lbfgs'))

cross_val_score(pipe1, X, y, cv=ps)

Expected Results

No error is thrown.

Actual Results

ValueError: The features [0] have missing values in transform 
but have no missing values in fit.

Thoughts

The original design was adopted because, if the training data has no missing value, there will be a column with all zeros. This type of error will appear when we try to do grid search over the add_indicator parameter. One way to work around this is to split the data in such a way that missing values are available (for each column that has a missing value) in both the training set and test set.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Jun 12, 2019

@pmattioli for the MissingIndicator that’s constructed in SimpleImputer.

0reactions
pmattiolicommented, Jun 12, 2019

We should have error_on_new=False by default for add_indicator.

@jnothman do you mean we should have error_on_new=False for MissingIndicator?

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.impute.SimpleImputer
The imputation strategy. If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
Read more >
How to Handle Missing Values in Cross Validation
The SimpleImputer fills in missing values based on the given strategy. ... We can now use this pipeline as estimator in cross validation....
Read more >
How to Fix k-Fold Cross-Validation for Imbalanced Classification
How a naive application of k-fold cross-validation and train-test splits will fail when evaluating classifiers on imbalanced datasets.
Read more >
How to use cross validation after imputing on a training and ...
I've split it into a training and validation set because there were missing values so I used SimpleImputer from sklearn and fit_transform-ed the ......
Read more >
Imputation when estimating prediction error - Cross Validated
Use the imputed test data to get an estimate of the prediction error. When building the imputation model I would include the outcome...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found