Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer should be able to use a function to select columns

See original GitHub issue

The new ColumnTransformer allows the user to specify column names or indices. I think it should be possible to specify a set of columns as a function. Indeed, towards #10603, we should probably have an inbuilt function that distinguishes between categorical and numeric pd.Series dtypes.

In order to support the remainder functionality, I believe the function should have signature: (X,) -> column indices where column indices is any of the supported column specification formats.

Sound good @amueller, @jorisvandenbossche?

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:21 (21 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Jun 10, 2018

I think you should be very careful to avoid any operator overloading. Scikit-learn tends to be quite conservative in this area, and tries to avoid “magic” in its APIs. Don’t worry about defining categorical_features, etc. Worry about ensuring that there’s an interface for users to write custom selectors that don’t require them to know the columns by name or index beforehand, which the current interface does. Later we can worry about making a library of such selectors and how that would look.

1reaction

jnothmancommented, Jun 5, 2018

Thanks @partmor, I look forward to your pr. Please start with supporting functoons, testing it, and demonstrating with a realistic example. Then we can consider simplifying the API common use cases. @jorisvandenbossche’s proposal does not, for instance, account for disparity in the definition of “categorical feature”