Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer with category_encoders doesn't encode "integer" columns

See original GitHub issue

Describe the bug

Not sure if this really be consider a bug report fo scikit-learn as this may be issue rising due to interaction between

ColumnTransformer
category_encoders
pandas (categorical datatype)

Anyways, I was testing use of category_encoders with ColumnTransformer and observed inconsistent behaviors

when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] of a df is passed to the ColumnTransformer it doesn’t run encoding processing but returns np.array of the “same” integers but in float64 e.g. [[1.0, 2.0, 3.,0…]…] even if the feature name e.g. “X1” is explicitly passed to the transformer. the feature names/list of columns needs to be passed to the category_encoders instantiation as pointed out by bmreiniger
when when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] are converted to pandas “category” datatypes and passed to ColumnTransformer, it returns np array of nan instead of encoded array
Instead when df of string/object values are passed, ColumnTransformer works and returns encoded array as expected.

Steps/Code to Reproduce

loading pkgs

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import category_encoders

set function for pipeline to repeat

def pipeline_wrapper(X_train, y_train, X_test, y_test, cat_cols):
    encoder = category_encoders.leave_one_out.LeaveOneOutEncoder(cols=cat_cols, verbose=1)
    categorical_transformer = Pipeline(steps=[('cat_encoder', encoder)])
    preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', LogisticRegression())])
    pipe.fit(X_train, y_train)
    print("test score: %.3f" % pipe.score(X_test, y_test))
    return pipe

load ca housing data for example and setting features and target

housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.data
nona = df.isna().sum(axis=0)
df = df[nona.index[nona==0]].copy()
## set target
df['target'] = (df['SaleCondition'] == "Normal").astype(np.int32)
df['target'].value_counts()

~~Case 1. (integer) categorical cols~~ Unexpected results observed here was due not passing list of columns names to the encoder initialization. I thought passing it to the ColumnTransformer would also pass to the encoder but that was not the case.

Case 2. if we cast “integer” columns to “category” and we get array of nan

cat_cols = ['MSSubClass','OverallQual','OverallCond', 'MoSold', 'GarageCars']
for col in cat_cols:
    df[col] = df[col].astype('category')
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes # category
p2 = pipeline_wrapper(X_train, y_train, X_test, y_test)

I get the following error message

File "/.../miniconda3/envs/ct38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

in fact, we can check that above error was because we were passing array of nans to lr

encoder = category_encoders.leave_one_out.LeaveOneOutEncoder
categorical_transformer = Pipeline(steps=[('cat_encoder', encoder())])
preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])
pipe.fit(X_train, y_train)
pipe[0].transform(X_train)[0:2]

we get array of nans

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Case 3. if we pass string columns (either as string or as categorical type), we can categorical encoded values

cat_cols = ['MSZoning', 'Condition1', 'BldgType', 'RoofStyle']
set(np.hstack(df[cat_cols].values)) # string categorical values
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes
p3 = pipeline_wrapper(X_train, y_train, X_test, y_test)
p3[0].transform(X_train)[0:2]

we get expected results

array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
       [0.82881907, 0.828125  , 0.82208589, 0.79565217]])

Expected Results

As seen in the 3rd case, we expect to see array of float64. Since each categorical value is encoded/“replaced” with float based on the distribution of target/response variable associated with the particular categorical values, we expect to see at most k different float values for k different “levels” of a categorical column.

array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
       [0.82881907, 0.828125  , 0.82208589, 0.79565217]])

Actual Results

~~Case 1: np array of integers represent in float64~~

Case 2: np array of nans

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Versions

System:
    python: 3.8.10 (default, Jun  4 2021, 15:09:15)  [GCC 7.5.0]
executable: /home/jk486j/miniconda3/envs/ct38/bin/python
   machine: Linux-4.11.0-14-generic-x86_64-with-glibc2.17

Python dependencies:
          pip: 22.0.4
   setuptools: 61.2.0
      sklearn: 1.0.1
        numpy: 1.21.6
        scipy: 1.8.0
       Cython: None
       pandas: 1.4.0
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
/..../miniconda3/envs/ct38/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

category_encoders.__version__: 2.4.0
pd.__version__: 1.4.0

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

bmreinigercommented, May 5, 2022

I think this is on category_encoders. From their docs, the parameter cols:

a list of columns to encode, if None, all string columns will be encoded.

Actually, that’s not entirely accurate either; the estimator runs utils.get_obj_cols, which checks for either object or categorical dtype.

So when the data is not-object-nor-categorical according to pandas, you need to explicitly list the columns; adding cols=cat_cols to your encoder instantiation seems to work as expected.

In your second approach, it looks like they handle transformation of categorical columns by casting them to strings, which probably doesn’t play nicely with the mapping they defined at fit time, whose entries are still numeric.

If you add an issue over at category_encoders, please ping me.

0reactions

jyk4100commented, May 6, 2022

closing issue