question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TST test_bagging_regressor/classifier_with_missing_inputs fails with SimpleImputer

See original GitHub issue

See #11480 Git blame shows that we’ve introduced regression in #11211 ping the author @jeremiedbb and the reviewers @jnothman @glemaitre @ogrisel @jorisvandenbossche Below are the logs from test_bagging_regressor_with_missing_inputs:

__________________ test_bagging_regressor_with_missing_inputs __________________
    def test_bagging_regressor_with_missing_inputs():
        # Check that BaggingRegressor can accept X with missing/infinite data
        X = np.array([
            [1, 3, 5],
            [2, None, 6],
            [2, np.nan, 6],
            [2, np.inf, 6],
            [2, np.NINF, 6],
        ])
        y_values = [
            np.array([2, 3, 3, 3, 3]),
            np.array([
                [2, 1, 9],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
                [3, 6, 8],
            ])
        ]
        for y in y_values:
            regressor = DecisionTreeRegressor()
            pipeline = make_pipeline(
                SimpleImputer(),
                SimpleImputer(missing_values=np.inf),
                SimpleImputer(missing_values=np.NINF),
                regressor
            )
>           pipeline.fit(X, y).predict(X)
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
pipeline   = Pipeline(memory=None,
     steps=[('simpleimputer-1', SimpleImputer(copy=True,...tion_leaf=0.0,
           presort=False, random_state=None, splitter='best'))])
regressor  = DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
    ...raction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
y          = array([2, 3, 3, 3, 3])
y_values   = [array([2, 3, 3, 3, 3]), array([[2, 1, 9],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8],
       [3, 6, 8]])]
/home/travis/build/scikit-learn/scikit-learn/sklearn/ensemble/tests/test_bagging.py:785: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:253: in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:218: in _fit
    **fit_params_steps[name])
/home/travis/build/scikit-learn/scikit-learn/sklearn/externals/_joblib/memory.py:362: in __call__
    return self.func(*args, **kwargs)
/home/travis/build/scikit-learn/scikit-learn/sklearn/pipeline.py:602: in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
/home/travis/build/scikit-learn/scikit-learn/sklearn/base.py:462: in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
/home/travis/build/scikit-learn/scikit-learn/sklearn/impute.py:209: in fit
    X = self._validate_input(X)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
X = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
    def _validate_input(self, X):
        allowed_strategies = ["mean", "median", "most_frequent", "constant"]
        if self.strategy not in allowed_strategies:
            raise ValueError("Can only use these strategies: {0} "
                             " got strategy={1}".format(allowed_strategies,
                                                        self.strategy))
    
        if self.strategy in ("most_frequent", "constant"):
            dtype = None
        else:
            dtype = FLOAT_DTYPES
    
        if not is_scalar_nan(self.missing_values):
            force_all_finite = True
        else:
            force_all_finite = "allow-nan"
    
        try:
            X = check_array(X, accept_sparse='csc', dtype=dtype,
                            force_all_finite=force_all_finite, copy=self.copy)
        except ValueError as ve:
            if "could not convert" in str(ve):
                raise ValueError("Cannot use {0} strategy with non-numeric "
                                 "data. Received datatype :{1}."
                                 "".format(self.strategy, X.dtype.kind))
            else:
>               raise ve
E               ValueError: Input contains infinity or a value too large for dtype('float64').
X          = array([[1, 3, 5],
       [2, None, 6],
       [2, nan, 6],
       [2, inf, 6],
       [2, -inf, 6]], dtype=object)
allowed_strategies = ['mean', 'median', 'most_frequent', 'constant']
dtype      = (<type 'numpy.float64'>, <type 'numpy.float32'>, <type 'numpy.float16'>)
force_all_finite = 'allow-nan'
self       = SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0)
ve         = ValueError("Input contains infinity or a value too large for dtype('float64').",)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Jul 12, 2018

I get not wanting to impose restrictions, but this is different to, say, forcing someone’s data have no NaNs in a meta-estimator. Sometimes it’s okay/good to be dogmatic unless shown a need to be lenient. And inf in a feature space should ring alarm bells.

0reactions
amuellercommented, Jul 15, 2018

So is there a PR/consensus on what to do? I don’t think we need to be able to impute np.inf.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.impute.SimpleImputer
SimpleImputer : Release Highlights for scikit-learn 1.1 Release Highlights for ... missing indicator even if there are missing values at transform/test time.
Read more >
Error when imputing minimum values using SimpleImputer
SimpleImputer only supports a single value for fill_value , not a per-column specification. Adding that was discussed in Issue19783, ...
Read more >
I am getting the error in SimpleImputer [closed]
Try this: from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values= np.NaN, strategy='most_frequent') imputer ...
Read more >
ML | Handle Missing Data with Simple Imputer - GeeksforGeeks
Python. Practice Tags : Machine Learning · python. Improve Article. Report Issue.
Read more >
Imputing Missing Values using the SimpleImputer Class in ...
If you pass in a 1D array or a Pandas Series, you will get an error. The transform() function return the result as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found