API options for Pandas output
See original GitHub issueRelated to:
- https://github.com/scikit-learn/scikit-learn/issues/5523 pandas in, pandas out
- https://github.com/scikit-learn/scikit-learn/issues/10603 typical data science use case
- https://github.com/scikit-learn/scikit-learn/pull/20100 array out in preprocessing
- #20110 output dataframes in column transformer
This issue summarizes all the options for pandas with a normal data science use case:
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
In all of the following options, pipe[-1].feature_names_in_
is used to get the feature names used in LogisticRegression
. All options require feature_names_in_
to enforce column name consistency between fit
and transform
.
Option 1: output
kwargs in transform
All transformers will accept a output='pandas'
in transform
. To configure transformers to output dataframes during fit
:
# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")
# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")
Pipeline will pass output="pandas"
to every transform method during fit
. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas"
to every transformer.transform
.
Option 2: __init__
parameter
All transformers will accept an transform_output
in __init__
:
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median',
transform_output="pandas")),
('scaler', StandardScaler(transform_output="pandas"))])
categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)],
transform_output="pandas")
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
# All transformers are configured to output dataframes
pipe.fit(X_train_df)
Option 2b: Have a global config to transform_output
For a better user experience, we can have a global config. By default, transform_output
is set to 'global'
in all transformers.
import sklearn
sklearn.set_config(transform_output="pandas")
pipe = ...
pipe.fit(X_train_df)
Option 3: Use SLEP 006
Have all transformers request output
. Similiar to Option 1, every transformer needs a output='pandas'
kwarg in transform
.
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
('scaler', StandardScaler().request_for_transform(output=True))])
categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)
preprocessor = (ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
.request_for_transform(output=True))
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
pipe.fit(X_train_df, output="pandas")
Option 3b: Have a global config for request
For a better user experience, we can have a global config:
import sklearn
sklearn.set_config(request_for_transform={"output": True})
pipe = ...
pipe.fit(X_train_df, output="pandas")
Summary
Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.
CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Top GitHub Comments
@ogrisel Option 2b without the
__init__
parameter is very close to my original PR with a global config: https://github.com/scikit-learn/scikit-learn/pull/16772 . I think we decided not to go down the path of having a global config.As for implementation, I would prefer not to hide it into a mixin and prefer something like https://github.com/scikit-learn/scikit-learn/pull/20100. The idea is to use
self._validate_data
to record the column names, and a decorator aroundtransform
handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely onself._validate_data
where we have two decorators, one forfit
:record_column_names
and fortransform
:wrap_transform
.I agree, we can close this issue.