question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer with category_encoders doesn't encode "integer" columns

See original GitHub issue

Describe the bug

Not sure if this really be consider a bug report fo scikit-learn as this may be issue rising due to interaction between

  • ColumnTransformer
  • category_encoders
  • pandas (categorical datatype)

Anyways, I was testing use of category_encoders with ColumnTransformer and observed inconsistent behaviors

  1. when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] of a df is passed to the ColumnTransformer it doesn’t run encoding processing but returns np.array of the “same” integers but in float64 e.g. [[1.0, 2.0, 3.,0…]…] even if the feature name e.g. “X1” is explicitly passed to the transformer. the feature names/list of columns needs to be passed to the category_encoders instantiation as pointed out by bmreiniger

  2. when when integer columns e.g., “X1”: [1.0, 2.0, 3.0, …] are converted to pandas “category” datatypes and passed to ColumnTransformer, it returns np array of nan instead of encoded array

  3. Instead when df of string/object values are passed, ColumnTransformer works and returns encoded array as expected.

Steps/Code to Reproduce

loading pkgs

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import category_encoders

set function for pipeline to repeat

def pipeline_wrapper(X_train, y_train, X_test, y_test, cat_cols):
    encoder = category_encoders.leave_one_out.LeaveOneOutEncoder(cols=cat_cols, verbose=1)
    categorical_transformer = Pipeline(steps=[('cat_encoder', encoder)])
    preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                            ('classifier', LogisticRegression())])
    pipe.fit(X_train, y_train)
    print("test score: %.3f" % pipe.score(X_test, y_test))
    return pipe

load ca housing data for example and setting features and target

housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.data
nona = df.isna().sum(axis=0)
df = df[nona.index[nona==0]].copy()
## set target
df['target'] = (df['SaleCondition'] == "Normal").astype(np.int32)
df['target'].value_counts()

Case 1. (integer) categorical cols Unexpected results observed here was due not passing list of columns names to the encoder initialization. I thought passing it to the ColumnTransformer would also pass to the encoder but that was not the case.

Case 2. if we cast “integer” columns to “category” and we get array of nan

cat_cols = ['MSSubClass','OverallQual','OverallCond', 'MoSold', 'GarageCars']
for col in cat_cols:
    df[col] = df[col].astype('category')
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes # category
p2 = pipeline_wrapper(X_train, y_train, X_test, y_test)

I get the following error message

File "/.../miniconda3/envs/ct38/lib/python3.8/site-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

in fact, we can check that above error was because we were passing array of nans to lr

encoder = category_encoders.leave_one_out.LeaveOneOutEncoder
categorical_transformer = Pipeline(steps=[('cat_encoder', encoder())])
preprocessor = ColumnTransformer(transformers=[("cat", categorical_transformer, cat_cols),])
pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])
pipe.fit(X_train, y_train)
pipe[0].transform(X_train)[0:2]

we get array of nans

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Case 3. if we pass string columns (either as string or as categorical type), we can categorical encoded values

cat_cols = ['MSZoning', 'Condition1', 'BldgType', 'RoofStyle']
set(np.hstack(df[cat_cols].values)) # string categorical values
X_train, X_test, y_train, y_test = train_test_split(df[cat_cols], df['target'], test_size=0.2, random_state=10)
X_train.dtypes
p3 = pipeline_wrapper(X_train, y_train, X_test, y_test)
p3[0].transform(X_train)[0:2]

we get expected results

array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
       [0.82881907, 0.828125  , 0.82208589, 0.79565217]])

Expected Results

As seen in the 3rd case, we expect to see array of float64. Since each categorical value is encoded/“replaced” with float based on the distribution of target/response variable associated with the particular categorical values, we expect to see at most k different float values for k different “levels” of a categorical column.

array([[0.87134503, 0.66666667, 0.78378378, 0.82894737],
       [0.82881907, 0.828125  , 0.82208589, 0.79565217]])

Actual Results

Case 1: np array of integers represent in float64

Case 2: np array of nans

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Versions

System:
    python: 3.8.10 (default, Jun  4 2021, 15:09:15)  [GCC 7.5.0]
executable: /home/jk486j/miniconda3/envs/ct38/bin/python
   machine: Linux-4.11.0-14-generic-x86_64-with-glibc2.17

Python dependencies:
          pip: 22.0.4
   setuptools: 61.2.0
      sklearn: 1.0.1
        numpy: 1.21.6
        scipy: 1.8.0
       Cython: None
       pandas: 1.4.0
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
/..../miniconda3/envs/ct38/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

category_encoders.__version__: 2.4.0
pd.__version__: 1.4.0

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
bmreinigercommented, May 5, 2022

I think this is on category_encoders. From their docs, the parameter cols:

a list of columns to encode, if None, all string columns will be encoded.

Actually, that’s not entirely accurate either; the estimator runs utils.get_obj_cols, which checks for either object or categorical dtype.

So when the data is not-object-nor-categorical according to pandas, you need to explicitly list the columns; adding cols=cat_cols to your encoder instantiation seems to work as expected.

In your second approach, it looks like they handle transformation of categorical columns by casting them to strings, which probably doesn’t play nicely with the mapping they defined at fit time, whose entries are still numeric.

If you add an issue over at category_encoders, please ping me.

0reactions
jyk4100commented, May 6, 2022

closing issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use transform categorical variables using encoders
Learn how to use Category Encoders to transform and convert categorical variables to numeric data that can be used within machine learning models....
Read more >
Issue with OneHotEncoder for categorical features
So you need to do two steps for your one hot encoded data ... also because we can transform multiple columns easily using...
Read more >
Encoding categorical variables
Many machine learning algorithms are not able to use non-numeric data. ... Category encoders doesn't drop a column, so a row of all...
Read more >
Column Transformer with Mixed Types
The categorical data is one-hot encoded via OneHotEncoder , which creates a new category for missing values. In addition, we show two different...
Read more >
Guide to Encoding Categorical Values in Python
As with many other aspects of the Data Science world, there is no single ... to transform the categorical data into suitable numeric...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found