question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE] Pandas-to-pandas pipelines

See original GitHub issue

Problem

As far as I know, allscikit-learn transformers support pandas dataframes as their inputs, but most change the type back into a numpy array. This is frustrating as we lose track of which column means what, for example:

  • when applying transformer-only pipelines,
  • when the number of columns increases in a pipeline, for instance when computing derived features,
  • when columns are dropped based on feature selection.

Examples of existing transformers with this behaviour are sklearn.pipeline.FeatureUnion (!!) and sklearn.preprocessing.StandardScaler.

No existing solution in sklearn

In the most recent version of sklearn (0.22.3) there has been the addition of the compose.ColumnTransformer aiming to replace the ColumnSelector + FeatureUnion pattern. However, I’ve noticed it’s not yet stable and does not work in several edge cases. This is based on my own experience and I could not find an existing issue to support this claim.

Also, the ColumnTransformer does not affect transformers that currently output numpy dataframes.

Issues asking for similar features in sklearn:

Proposal

Currently, Pipelines already support pandas-to-pandas as long as the intermediate transformers do. Recoding all existing transformers is a lot of work and likely not within the scope of this package, which is why I suggest:

  • writing a decorator that turns existing transformers into pandas transformers. Note: I think this concept only works on transformers that do not add new columns.

  • only re-implementi FeatureUnion as it’s such a core component to this workflow.

Example implementation for the proposed decorator

Implementation

def pandify(class_type: type, suffix: str=""):
    """Decorator for having a standard scikit-learn transformer output dataframes.

    A standard transformer is a transformer that outputs the same number of columns
    as it receives as input.

    """

    class PandasTransformer(class_type):
        def transform(self, X: pd.DataFrame) -> pd.DataFrame:
            result = super().transform(X)
            return pd.DataFrame(
                result, index=X.index, columns=[c + suffix for c in X.columns]
            )

    # For later object introspection, as PandasTransformer is not a helpful class type.
    PandasTransformer.__name__ = class_type.__name__
    PandasTransformer.__doc__ = class_type.__doc__

    return PandasTransformer

Example use

Pipeline(
        [
            ("imputer", pandify(SimpleImputer)(strategy="constant", fill_value=-1)),
            ("scaler", pandify(StandardScaler)())
        ]
)

Example implementation for the proposed FeatureUnion fix

This implementation subclasses the existing implementation overriding the parts assuming numpy arrays.

class PdFeatureUnion(sklearn.pipeline.FeatureUnion):
    """
    Hot-fix on the sklearn.pipeline.FeatureUnion class to support union of dataframes.
    Affected methods are largely copied from the existing implementation.
    """

    def fit_transform(self, X, y=None, **fit_params):
        self._validate_transformers()
        result = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_transform_one)(trans, X, y, weight, **fit_params)
            for name, trans, weight in self._iter()
        )

        if not result:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

    def merge_dataframes_by_column(self, Xs):
        return pd.concat(Xs, axis="columns", copy=False)

    def transform(self, X):
        Xs = Parallel(n_jobs=self.n_jobs)(
            delayed(_transform_one)(trans, X, None, weight)
            for name, trans, weight in self._iter()
        )
        if not Xs:
            # All transformers are None
            return np.zeros((X.shape[0], 0))
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = self.merge_dataframes_by_column(Xs)
        return Xs

Example use of the FeatureUnion

Exactly the same as the existing FeatureUnion.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
koaningcommented, Mar 6, 2020

Hi @tomderuijter. Thanks for the preparation of this issue 😃

It’s interesting you mention this, we’ve been munging over this issue as well for a while.

A few things jump to mind on my side, but @MBrouns probably also has good things to mention here.

  1. I believe scikit-learn at some point wants to adopt the pattern for structure arrays in the long term … at least … i remember Gael talking about this. This does not mean that we can’t have something for pandas here in the mean time but we do want to keep this in mind.
  2. I wonder if the functional approach is the way forward here. Functionally (get it?) there’s nothing wrong with it, but it is an anti pattern since scikit-learn (and we) also have a notion of a meta-estimator. The idea is very similar but you’d have an object performing the operation instead of a function.
  3. I don’t mind the idea of a PandasFeatureUnion but I wonder if we need it. If everything that is being concatenated is a dataframe … will scikit-learn still turn it into a numpy array?
  4. Maybe good to discuss behavior too. Suppose you’d have a PCA(2) being applied … what will be the column names that come out after we ensure a dataframe comes out of it? Ideally we’d have something more intelligent than colnames=[0, 1], no?
1reaction
koaningcommented, Mar 12, 2020

In that case. I think we should wait until get_feature_names is a bit more mature before we explore supporting that here. Designing for it now while it is still early feels like a premature optimisation.

I would be cool with a DataFrame transformer (it may also really help with some of our internals). In situations where the shape does not change I think keeping the names the same should suffice. In situations where they do change we can introduce a suffix like f"self.__class__{colnum}" but it might be nice if the user has the ability to overwrite the column names. Potentially passed via the __init__(self, new_col_names=[...].

I wonder about edges cases. Are there situations where the size of the data remains the same but the names should change?

Also … @MBrouns got strong opinions?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build pipelines with Pandas using pdpipe
We show how to build intuitive and useful pipelines with Pandas DataFrame using ... media platforms and adding new features to the package...
Read more >
How can I avoid Pandas dataframe in Spark pipeline?
So, starting with the first pandas usage, instead of pd.concat([sdf.toPandas()['raw'] , have you tried using sparks withColumn function? – ...
Read more >
PySpark to Pandas | Converting Data Frame using ... - eduCBA
Introduction to PySpark to Pandas. Pyspark to pandas is used to convert data frame, we can convert the data frame by using function...
Read more >
Pandas vs PySpark DataFrame With Examples
Let's learn the difference between Pandas vs PySpark DataFrame, their definitions, features, advantages, how to create them and transform one to another.
Read more >
Failed to convert Spark.sql to Pandas Dataframe using ...
Today, I opened Azure Databricks. When I imported python libraries. Databricks told me that toPandas() was deprecated and it suggested me to use...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found