question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsisent behaivor in ColumnTransformer with "all False" vs "empty list"

See original GitHub issue

Describe the workflow you want to enable

I have a selector function I’m using in a ColumnTransformer that’s returning all False, which should be equivalent to an empty list.

But “all false” from the selector function is not being treated equivalent to “empty list from explicit column names”.

“all False” should skip the transformer, in the same way an empty list does: https://github.com/scikit-learn/scikit-learn/issues/12071

Describe your proposed solution

treat a list of all False equivalent to an empty list

Describe alternatives you’ve considered, if relevant

Evaluating the Falses myself to an empty list

Additional context

Code snipped:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

def noop_selector(X):
    return [False for col in X]

working_pipeline = ColumnTransformer(
    transformers=[
        ('noop', SimpleImputer(), []),
    ]
)

broken_pipeline = ColumnTransformer(
    transformers=[
        ('noop', SimpleImputer(), noop_selector),
    ]
)

data = pd.DataFrame({'col': ['a', 'b', 'c']})
working_pipeline.fit(data)
broken_pipeline.fit(data)

The working pipeline skips the simple imputer. The broken pipeline tries to impute with no data, resulting in ValueError: at least one array or dtype is required.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Jul 2, 2020

Do we want to support “list of boolean” for column selection?

Yes, though instead of having an additional extra case, maybe we can merge it with the one above about arrays. I’m not quite sure why _is_empty_column_selection was written the way it is, but hopefully our existing tests will tell us what should/could be done. Feel free to submit a PR with a first version @zachmayer and we can iterate from there

0reactions
zachmayercommented, Jul 8, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

ColumnTransformer requires at least one column for each part ...
The scenario of me having an empty list and a list with all their values equal to "False" are not the same? I...
Read more >
How to Use the ColumnTransformer for Data Preparation
Applying data transforms like scaling or encoding categorical variables is straightforward when all input variables are the same type.
Read more >
sklearn.compose.ColumnTransformer
Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be...
Read more >
Extracting, transforming and selecting features - Apache Spark
Assume that we have a DataFrame with 4 input columns real , bool , stringNum , and string . These different data types...
Read more >
Shape gets changed when preprocessing with column ...
When transforming your test data, you should only transform the data with the ColumnTransformer and not fit it; The OneHotEncoder is initialized ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found