ColumnTransformer should be able to use a function to select columns
See original GitHub issueThe new ColumnTransformer allows the user to specify column names or indices. I think it should be possible to specify a set of columns as a function. Indeed, towards #10603, we should probably have an inbuilt function that distinguishes between categorical and numeric pd.Series
dtypes.
In order to support the remainder
functionality, I believe the function should have signature: (X,) -> column indices
where column indices is any of the supported column specification formats.
Sound good @amueller, @jorisvandenbossche?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:21 (21 by maintainers)
Top Results From Across the Web
How to Use the ColumnTransformer for Data Preparation
In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed ......
Read more >sklearn.compose.ColumnTransformer
To select multiple columns by name or dtype, you can use make_column_selector . remainder{'drop', 'passthrough'} or estimator, default='drop'.
Read more >Use ColumnTransformer to apply different preprocessing to ...
Use ColumnTransformer to apply different preprocessing to different columns :- select from DataFrame columns by name- passthrough or drop ...
Read more >how to use ColumnTransformer() to return a dataframe?
With sklearn version 1.2.0 it will be possible to solve the problem of returning a DataFrame when transforming a ColumnTransformer instance ...
Read more >Pipeline, ColumnTransformer and FeatureUnion explained
We will use estimator and model interchangeably throughout this post. ... we added an extra step where we selected relevant columns using a ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think you should be very careful to avoid any operator overloading. Scikit-learn tends to be quite conservative in this area, and tries to avoid “magic” in its APIs. Don’t worry about defining categorical_features, etc. Worry about ensuring that there’s an interface for users to write custom selectors that don’t require them to know the columns by name or index beforehand, which the current interface does. Later we can worry about making a library of such selectors and how that would look.
Thanks @partmor, I look forward to your pr. Please start with supporting functoons, testing it, and demonstrating with a realistic example. Then we can consider simplifying the API common use cases. @jorisvandenbossche’s proposal does not, for instance, account for disparity in the definition of “categorical feature”