question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Boolean column to Imputer breaks if it is the only categorical column

See original GitHub issue

If only a boolean column is passed to a SimpleImputer, or if a boolean column is the only non-numeric column passed to an Imputer, it breaks:

Repro: the following code fails with a ValueError: SimpleImputer does not support data with dtype bool. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype. check.

X = pd.DataFrame({
        "bool col with nan": pd.Series([True, np.nan, False, np.nan, True], dtype='bool'),
    })
imputer = Imputer()
imputer.fit(X, y)

However, as soon as we add another column (object, category, with or without np.nan), the imputer works fine:

X = pd.DataFrame({
        "object with nan": ["b", "b", np.nan, "c", np.nan],
        "bool col with nan": pd.Series([True, np.nan, False, np.nan, True], dtype='bool'),
    })
imputer = Imputer()
imputer.fit(X, y)

The stack trace is as follows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-10afc6697657> in <module>
     23     })
     24 imputer = Imputer()
---> 25 imputer.fit(X, y)
     26 # lp.fit(X, y)

~/Desktop/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/Desktop/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
     78         if len(X_categorical.columns) > 0:
     79             import pdb; pdb.set_trace()
---> 80             self._categorical_imputer.fit(X_categorical, y)
     81             self._categorical_cols = X_categorical.columns
     82         return self

~/Desktop/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/Desktop/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
     47         X = X.fillna(value=np.nan)
     48 
---> 49         self._component_obj.fit(X, y)
     50         self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
     51         return self

~/Desktop/venv/lib/python3.7/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    266         self : SimpleImputer
    267         """
--> 268         X = self._validate_input(X)
    269         super()._fit_indicator(X)
    270 

~/Desktop/venv/lib/python3.7/site-packages/sklearn/impute/_base.py in _validate_input(self, X)
    249                              "categorical data represented either as an array "
    250                              "with integer dtype or an array of string values "
--> 251                              "with an object dtype.".format(X.dtype))
    252 
    253         return X

ValueError: SimpleImputer does not support data with dtype bool. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.

Not quite sure what we can do, given this looks like a check on the scikit-learn side (via _validate_input); we could pass in a fake column to appease this check?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
dsherrycommented, Nov 6, 2020

@christopherbunn got it, yep seems like we forgot about it in Oct, NBD. But yes please do!

0reactions
christopherbunncommented, Nov 6, 2020

This has been sitting for a while. I think I had a PR for this a while back but I was going to wait for the September release before merging in. I can rebase and see if it still solves the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What do I replace missing values with in a Boolean field?
Hi folks, I am trying to build a classification model on a dataset that has most of it's field to be 1's and...
Read more >
Why is SimpleImputer returning categorical data?
I'm imputing values into a dataframe using fillna for the numerical columns and SimpleImputer for the ...
Read more >
Preprocessing: Encode and KNN Impute All Categorical ...
Before putting our data through models, two steps that need to be performed on categorical data is encoding and dealing with missing nulls....
Read more >
Introducing the column transformer
In this article, we will first walk through a fairly simple example to see how to use a ColumnTransformer , and then a...
Read more >
Guide to Encoding Categorical Values in Python
Since this article will only focus on encoding the categorical variables, we are going to include only the object columns in our dataframe....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found