question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API options for Pandas output

See original GitHub issue

Related to:

This issue summarizes all the options for pandas with a normal data science use case:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])

In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.

Option 1: output kwargs in transform

All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:

# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")

Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.

Option 2: __init__ parameter

All transformers will accept an transform_output in __init__:

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)

Option 2b: Have a global config to transform_output

For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.

import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)

Option 3: Use SLEP 006

Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")

Option 3b: Have a global config for request

For a better user experience, we can have a global config:

import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")

Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
thomasjpfancommented, Jun 26, 2021

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: https://github.com/scikit-learn/scikit-learn/pull/16772 . I think we decided not to go down the path of having a global config.

As for implementation, I would prefer not to hide it into a mixin and prefer something like https://github.com/scikit-learn/scikit-learn/pull/20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.

0reactions
thomasjpfancommented, Sep 26, 2022

I agree, we can close this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Options and settings — pandas 1.5.2 documentation
pandas has an options API configure and customize global behavior related to DataFrame display, data behavior and more. Options have a full “dotted-style”, ......
Read more >
Turn a Pandas DataFrame into an API - Towards Data Science
Pandas DataFrames are my favorite way to manipulate data in Python. In fact, the end product of many of my small analytics projects...
Read more >
Pandas API on Spark — PySpark 3.3.1 documentation
Pandas API on Spark¶ · Options and settings · Getting and setting options · From/to pandas and PySpark DataFrames · pandas · Transform...
Read more >
How to create a Pandas Dataframe from an API Endpoint in a ...
Import Pandas · Import Requests. Also, what is Requests? · Make a GET request from an API endpoint · Extract Data from the...
Read more >
Pandas API — hvPlot 0.8.2 documentation
If hvplot and pandas are both installed, then we can use the pandas.options.plotting.backend to control the output of pd.DataFrame.plot and pd.Series.plot .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found