Boolean column to Imputer breaks if it is the only categorical column
See original GitHub issueIf only a boolean column is passed to a SimpleImputer, or if a boolean column is the only non-numeric column passed to an Imputer, it breaks:
Repro: the following code fails with a ValueError: SimpleImputer does not support data with dtype bool. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.
check.
X = pd.DataFrame({
"bool col with nan": pd.Series([True, np.nan, False, np.nan, True], dtype='bool'),
})
imputer = Imputer()
imputer.fit(X, y)
However, as soon as we add another column (object, category, with or without np.nan), the imputer works fine:
X = pd.DataFrame({
"object with nan": ["b", "b", np.nan, "c", np.nan],
"bool col with nan": pd.Series([True, np.nan, False, np.nan, True], dtype='bool'),
})
imputer = Imputer()
imputer.fit(X, y)
The stack trace is as follows:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-10afc6697657> in <module>
23 })
24 imputer = Imputer()
---> 25 imputer.fit(X, y)
26 # lp.fit(X, y)
~/Desktop/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
12 @wraps(method)
13 def _set_fit(self, X, y=None):
---> 14 return_value = method(self, X, y)
15 self._is_fitted = True
16 return return_value
~/Desktop/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
78 if len(X_categorical.columns) > 0:
79 import pdb; pdb.set_trace()
---> 80 self._categorical_imputer.fit(X_categorical, y)
81 self._categorical_cols = X_categorical.columns
82 return self
~/Desktop/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
12 @wraps(method)
13 def _set_fit(self, X, y=None):
---> 14 return_value = method(self, X, y)
15 self._is_fitted = True
16 return return_value
~/Desktop/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
47 X = X.fillna(value=np.nan)
48
---> 49 self._component_obj.fit(X, y)
50 self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
51 return self
~/Desktop/venv/lib/python3.7/site-packages/sklearn/impute/_base.py in fit(self, X, y)
266 self : SimpleImputer
267 """
--> 268 X = self._validate_input(X)
269 super()._fit_indicator(X)
270
~/Desktop/venv/lib/python3.7/site-packages/sklearn/impute/_base.py in _validate_input(self, X)
249 "categorical data represented either as an array "
250 "with integer dtype or an array of string values "
--> 251 "with an object dtype.".format(X.dtype))
252
253 return X
ValueError: SimpleImputer does not support data with dtype bool. Please provide either a numeric array (with a floating point or integer dtype) or categorical data represented either as an array with integer dtype or an array of string values with an object dtype.
Not quite sure what we can do, given this looks like a check on the scikit-learn side (via _validate_input
); we could pass in a fake column to appease this check?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
@christopherbunn got it, yep seems like we forgot about it in Oct, NBD. But yes please do!
This has been sitting for a while. I think I had a PR for this a while back but I was going to wait for the September release before merging in. I can rebase and see if it still solves the issue.