question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Named col indexing fails with ColumnTransformer remainder on changing DataFrame column ordering

See original GitHub issue

Description

I am using ColumnTransformer to prepare (impute etc.) heterogenous data. I use a DataFrame to have more control on the different (types of) columns by their name.

I had some really cryptic problems when downstream transformers complained of data of the wrong type, while the ColumnTransformer should have divided them up properly.

I found out that ColumnTransformer silently passes the wrong columns along as remainder when

  • specifying columns by name,
  • using the remainder option, and using
  • DataFrames where column ordering can differ between fit and transform

In this case, the wrong columns are passed on to the downstream transformers, as the example demonstrates:

Steps/Code to Reproduce

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
import pandas as pd

def msg(msg):
  def print_cols(X, y=None):
    print(msg, list(X.columns))
    return X
  return print_cols

ct = make_column_transformer(
  (FunctionTransformer(msg('col a'), validate=False), ['a']),
  remainder=FunctionTransformer(msg('remainder'), validate=False)
)

fit_df = pd.DataFrame({
  'a': [2,3], 
  'b': [4,5]})

ct.fit(fit_df)

# prints:
# cols a ['a']
# remainder ['b']

transform_df = pd.DataFrame({
  'b': [4,5],  # note that column ordering
  'a': [2,3]}) # is the only difference to fit_df

ct.transform(transform_df)

# prints:
# col a ['a']
# remainder ['a'] <-- Should be ['b']

Expected Results

When using ColumnTransformer with a DataFrame and specifying columns by name, remainder should reference the same columns when fitting and when transforming ([‘b’] in above example), regardless of the column positions in the data during fitting and transforming.

Actual Results

remainder appears to, during fitting, remember remaining named DataFrame columns by their numeric index (not by their names), which (silently) leads to the wrong columns being handled downstream if the transformed DataFrame’s column ordering differs from that of the fitted DataFrame.

Position in module where the remainder index is determined: https://github.com/scikit-learn/scikit-learn/blob/7813f7efb5b2012412888b69e73d76f2df2b50b6/sklearn/compose/_column_transformer.py#L309

My current workaround is to not use the remainder option but specify all columns by name explicitly.

Versions

System: python: 3.7.3 (default, Mar 30 2019, 03:44:34) [Clang 9.1.0 (clang-902.0.39.2)] executable: /Users/asschude/.local/share/virtualenvs/launchpad-mmWds3ry/bin/python machine: Darwin-17.7.0-x86_64-i386-64bit

BLAS: macros: NO_ATLAS_INFO=3, HAVE_CBLAS=None lib_dirs: cblas_libs: cblas

Python deps: pip: 19.1.1 setuptools: 41.0.1 sklearn: 0.21.2 numpy: 1.16.4 scipy: 1.3.0 Cython: None pandas: 0.24.2

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (12 by maintainers)

github_iconTop GitHub Comments

3reactions
jnothmancommented, Jan 25, 2020

Well instead of using remainder you can specify a transformation for each feature, and use ‘passthrough’?

2reactions
jnothmancommented, Jul 1, 2019

I think we would rather start by being conservative (especially as this will inform people that their code previously didn’t work), and consider extending it later to be more permissive.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Problem with ColumnTransformer. DataFrame column ...
I try to use SimpleImputer in specifics columns of my DataFrame and then use this preprocesing on a pipeline. I got this dataframe:....
Read more >
sklearn.compose.ColumnTransformer
Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be...
Read more >
How to Use the ColumnTransformer for Data Preparation
In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed ......
Read more >
5. Preprocessing Categorical Features and Column Transformer
Ordinal encoding gives an ordinal numeric value to each unique value in the column. Let's take a look at a dummy dataframe to...
Read more >
Pipeline, ColumnTransformer and FeatureUnion explained
That's because I prefer to use column names from the data that the ... the categorical columns and specified remainder='passthrough' to keep ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found