ColumnTransformer with category_encoders doesn't encode "integer" columns
See original GitHub issueDescribe the bug
Not sure if this really be consider a bug report fo scikit-learn as this may be issue rising due to interaction between
ColumnTransformer
category_encoders
pandas
(categorical datatype)
Anyways, I was testing use of category_encoders
with ColumnTransformer
and observed inconsistent behaviors
-
when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] of a df is passed to thethe feature names/list of columns needs to be passed to theColumnTransformer
it doesn’t run encoding processing but returns np.array of the “same” integers but in float64 e.g. [[1.0, 2.0, 3.,0…]…] even if the feature name e.g. “X1” is explicitly passed to the transformer.category_encoders
instantiation as pointed out by bmreiniger -
when when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] are converted to
pandas
“category” datatypes and passed toColumnTransformer
, it returns np array of nan instead of encoded array -
Instead when df of string/object values are passed,
ColumnTransformer
works and returns encoded array as expected.
Steps/Code to Reproduce
loading pkgs
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import category_encoders
set function for pipeline to repeat
def pipeline_wrapper(X_train, y_train, X_test, y_test, cat_cols):
encoder = category_encoders.leave_one_out.LeaveOneOutEncoder(cols=cat_cols, verbose=1)
categorical_transformer = Pipeline(steps=[('cat_encoder', encoder)])
preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
pipe.fit(X_train, y_train)
print("test score: %.3f" % pipe.score(X_test, y_test))
return pipe
load ca housing data for example and setting features and target
housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.data
nona = df.isna().sum(axis=0)
df = df[nona.index[nona==0]].copy()
## set target
df['target'] = (df['SaleCondition'] == "Normal").astype(np.int32)
df['target'].value_counts()
Case 1. (integer) categorical cols
Unexpected results observed here was due not passing list of columns names to the encoder initialization. I thought passing it to the ColumnTransformer
would also pass to the encoder but that was not the case.
Case 2. if we cast “integer” columns to “category” and we get array of nan
cat_cols = ['MSSubClass','OverallQual','OverallCond', 'MoSold', 'GarageCars']
for col in cat_cols:
df[col] = df[col].astype('category')
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes # category
p2 = pipeline_wrapper(X_train, y_train, X_test, y_test)
I get the following error message
File "/.../miniconda3/envs/ct38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) # noqa
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
in fact, we can check that above error was because we were passing array of nans to lr
encoder = category_encoders.leave_one_out.LeaveOneOutEncoder
categorical_transformer = Pipeline(steps=[('cat_encoder', encoder())])
preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])
pipe.fit(X_train, y_train)
pipe[0].transform(X_train)[0:2]
we get array of nans
array([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
Case 3. if we pass string columns (either as string or as categorical type), we can categorical encoded values
cat_cols = ['MSZoning', 'Condition1', 'BldgType', 'RoofStyle']
set(np.hstack(df[cat_cols].values)) # string categorical values
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes
p3 = pipeline_wrapper(X_train, y_train, X_test, y_test)
p3[0].transform(X_train)[0:2]
we get expected results
array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
[0.82881907, 0.828125 , 0.82208589, 0.79565217]])
Expected Results
As seen in the 3rd case, we expect to see array of float64. Since each categorical value is encoded/“replaced” with float based on the distribution of target/response variable associated with the particular categorical values, we expect to see at most k different float values for k different “levels” of a categorical column.
array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
[0.82881907, 0.828125 , 0.82208589, 0.79565217]])
Actual Results
Case 1: np array of integers represent in float64
Case 2: np array of nans
array([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
Versions
System:
python: 3.8.10 (default, Jun 4 2021, 15:09:15) [GCC 7.5.0]
executable: /home/jk486j/miniconda3/envs/ct38/bin/python
machine: Linux-4.11.0-14-generic-x86_64-with-glibc2.17
Python dependencies:
pip: 22.0.4
setuptools: 61.2.0
sklearn: 1.0.1
numpy: 1.21.6
scipy: 1.8.0
Cython: None
pandas: 1.4.0
matplotlib: 3.5.1
joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True
/..../miniconda3/envs/ct38/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
category_encoders.__version__: 2.4.0
pd.__version__: 1.4.0
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
I think this is on
category_encoders
. From their docs, the parametercols
:Actually, that’s not entirely accurate either; the estimator runs
utils.get_obj_cols
, which checks for either object or categorical dtype.So when the data is not-object-nor-categorical according to pandas, you need to explicitly list the columns; adding
cols=cat_cols
to your encoder instantiation seems to work as expected.In your second approach, it looks like they handle transformation of categorical columns by casting them to strings, which probably doesn’t play nicely with the mapping they defined at fit time, whose entries are still numeric.
If you add an issue over at
category_encoders
, please ping me.closing issue