Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TST test_bagging_regressor/classifier_with_missing_inputs fails with SimpleImputer

See original GitHub issue

See #11480 Git blame shows that we’ve introduced regression in #11211 ping the author @jeremiedbb and the reviewers @jnothman @glemaitre @ogrisel @jorisvandenbossche Below are the logs from test_bagging_regressor_with_missing_inputs:

__________________ test_bagging_regressor_with_missing_inputs __________________
    def test_bagging_regressor_with_missing_inputs():
        # Check that BaggingRegressor can accept X with missing/infinite data
        X = np.array([
            [1, 3, 5],
            [2, None, 6],
            [2, np.nan, 6],
            [2, np.inf, 6],
            [2, np.NINF, 6],
        ])
        y_values = [
            np.array([2, 3, 3, 3, 3]),
            np.array([
                [2, 1, 9],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
            ])
        ]
        for y in y_values:
            regressor = DecisionTreeRegressor()
            pipeline = make_pipeline(
                SimpleImputer(),
                SimpleImputer(missing_values=np.inf),
                SimpleImputer(missing_values=np.NINF),
                regressor
            )
>           pipeline.fit(X, y).predict(X)
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
pipeline   = Pipeline(memory=None,
     steps=[('simpleimputer-1', SimpleImputer(copy=True,...tion_leaf=0.0,
           presort=False, random_state=None, splitter='best'))])
regressor  = DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
    ...raction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
y          = array([2, 3, 3, 3, 3])
y_values   = [array([2, 3, 3, 3, 3]), array([[2, 1, 9],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8]])]
/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/tests/test_bagging.py:785: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:253: in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:218: in _fit
    **fit_params_steps[name])
/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/_joblib/memory.py:362: in __call__
    return self.func(*args, **kwargs)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:602: in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/base.py:462: in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
/home/travis/build/scikit-learn/scikit-learn/sklearn/impute.py:209: in fit
    X = self._validate_input(X)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
X = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
    def _validate_input(self, X):
        allowed_strategies = ["mean", "median", "most_frequent", "constant"]
        if self.strategy not in allowed_strategies:
            raise ValueError("Can only use these strategies: {0} "
                             " got strategy={1}".format(allowed_strategies,
                                                        self.strategy))
    
        if self.strategy in ("most_frequent", "constant"):
            dtype = None
        else:
            dtype = FLOAT_DTYPES
    
        if not is_scalar_nan(self.missing_values):
            force_all_finite = True
        else:
            force_all_finite = "allow-nan"
    
        try:
            X = check_array(X, accept_sparse='csc', dtype=dtype,
                            force_all_finite=force_all_finite, copy=self.copy)
        except ValueError as ve:
            if "could not convert" in str(ve):
                raise ValueError("Cannot use {0} strategy with non-numeric "
                                 "data. Received datatype :{1}."
                                 "".format(self.strategy, X.dtype.kind))
            else:
>               raise ve
E               ValueError: Input contains infinity or a value too large for dtype('float64').
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
allowed_strategies = ['mean', 'median', 'most_frequent', 'constant']
dtype      = (<type 'numpy.float64'>, <type 'numpy.float32'>, <type 'numpy.float16'>)
force_all_finite = 'allow-nan'
self       = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
ve         = ValueError("Input contains infinity or a value too large for dtype('float64').",)

Issue Analytics

State:
Created 5 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Jul 12, 2018

I get not wanting to impose restrictions, but this is different to, say, forcing someone’s data have no NaNs in a meta-estimator. Sometimes it’s okay/good to be dogmatic unless shown a need to be lenient. And inf in a feature space should ring alarm bells.

0reactions

amuellercommented, Jul 15, 2018

So is there a PR/consensus on what to do? I don’t think we need to be able to impute np.inf.

Top Results From Across the Web

sklearn.impute.SimpleImputer

SimpleImputer : Release Highlights for scikit-learn 1.1 Release Highlights for ... missing indicator even if there are missing values at transform/test time.

Error when imputing minimum values using SimpleImputer

SimpleImputer only supports a single value for fill_value , not a per-column specification. Adding that was discussed in Issue19783, ...

I am getting the error in SimpleImputer [closed]

Try this: from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values= np.NaN, strategy='most_frequent') imputer ...

ML | Handle Missing Data with Simple Imputer - GeeksforGeeks

Python. Practice Tags : Machine Learning · python. Improve Article. Report Issue.

Imputing Missing Values using the SimpleImputer Class in ...

If you pass in a 1D array or a Pandas Series, you will get an error. The transform() function return the result as...