Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API options for Pandas output

See original GitHub issue

Related to:

https://github.com/scikit-learn/scikit-learn/issues/5523 pandas in, pandas out
https://github.com/scikit-learn/scikit-learn/issues/10603 typical data science use case
https://github.com/scikit-learn/scikit-learn/pull/20100 array out in preprocessing
#20110 output dataframes in column transformer

This issue summarizes all the options for pandas with a normal data science use case:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])

In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.

Option 1: `output` kwargs in `transform`

All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:

# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")

Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.

Option 2: `init` parameter

All transformers will accept an transform_output in __init__:

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)

Option 2b: Have a global config to `transform_output`

For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.

import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)

Option 3: Use SLEP 006

Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")

Option 3b: Have a global config for request

For a better user experience, we can have a global config:

import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")

Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

thomasjpfancommented, Jun 26, 2021

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: https://github.com/scikit-learn/scikit-learn/pull/16772 . I think we decided not to go down the path of having a global config.

As for implementation, I would prefer not to hide it into a mixin and prefer something like https://github.com/scikit-learn/scikit-learn/pull/20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.

0reactions

thomasjpfancommented, Sep 26, 2022

I agree, we can close this issue.

Top Results From Across the Web

Options and settings — pandas 1.5.2 documentation

pandas has an options API configure and customize global behavior related to DataFrame display, data behavior and more. Options have a full “dotted-style”, ......

Turn a Pandas DataFrame into an API - Towards Data Science

Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects...

Pandas API on Spark — PySpark 3.3.1 documentation

Pandas API on Spark¶ · Options and settings · Getting and setting options · From/to pandas and PySpark DataFrames · pandas · Transform...

How to create a Pandas Dataframe from an API Endpoint in a ...

Import Pandas · Import Requests. Also, what is Requests? · Make a GET request from an API endpoint · Extract Data from the...

Pandas API — hvPlot 0.8.2 documentation

If hvplot and pandas are both installed, then we can use the pandas.options.plotting.backend to control the output of pd.DataFrame.plot and pd.Series.plot .