[FEATURE] Pandas-to-pandas pipelines
See original GitHub issueProblem
As far as I know, allscikit-learn transformers support pandas dataframes as their inputs, but most change the type back into a numpy array. This is frustrating as we lose track of which column means what, for example:
- when applying transformer-only pipelines,
- when the number of columns increases in a pipeline, for instance when computing derived features,
- when columns are dropped based on feature selection.
Examples of existing transformers with this behaviour are sklearn.pipeline.FeatureUnion
(!!) and sklearn.preprocessing.StandardScaler
.
No existing solution in sklearn
In the most recent version of sklearn (0.22.3) there has been the addition of the compose.ColumnTransformer
aiming to replace the ColumnSelector
+ FeatureUnion
pattern. However, I’ve noticed it’s not yet stable and does not work in several edge cases. This is based on my own experience and I could not find an existing issue to support this claim.
Also, the ColumnTransformer
does not affect transformers that currently output numpy dataframes.
Issues asking for similar features in sklearn:
Proposal
Currently, Pipelines already support pandas-to-pandas as long as the intermediate transformers do. Recoding all existing transformers is a lot of work and likely not within the scope of this package, which is why I suggest:
-
writing a decorator that turns existing transformers into pandas transformers. Note: I think this concept only works on transformers that do not add new columns.
-
only re-implementi
FeatureUnion
as it’s such a core component to this workflow.
Example implementation for the proposed decorator
Implementation
def pandify(class_type: type, suffix: str=""):
"""Decorator for having a standard scikit-learn transformer output dataframes.
A standard transformer is a transformer that outputs the same number of columns
as it receives as input.
"""
class PandasTransformer(class_type):
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
result = super().transform(X)
return pd.DataFrame(
result, index=X.index, columns=[c + suffix for c in X.columns]
)
# For later object introspection, as PandasTransformer is not a helpful class type.
PandasTransformer.__name__ = class_type.__name__
PandasTransformer.__doc__ = class_type.__doc__
return PandasTransformer
Example use
Pipeline(
[
("imputer", pandify(SimpleImputer)(strategy="constant", fill_value=-1)),
("scaler", pandify(StandardScaler)())
]
)
Example implementation for the proposed FeatureUnion fix
This implementation subclasses the existing implementation overriding the parts assuming numpy arrays.
class PdFeatureUnion(sklearn.pipeline.FeatureUnion):
"""
Hot-fix on the sklearn.pipeline.FeatureUnion class to support union of dataframes.
Affected methods are largely copied from the existing implementation.
"""
def fit_transform(self, X, y=None, **fit_params):
self._validate_transformers()
result = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_transform_one)(trans, X, y, weight, **fit_params)
for name, trans, weight in self._iter()
)
if not result:
# All transformers are None
return np.zeros((X.shape[0], 0))
Xs, transformers = zip(*result)
self._update_transformer_list(transformers)
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = self.merge_dataframes_by_column(Xs)
return Xs
def merge_dataframes_by_column(self, Xs):
return pd.concat(Xs, axis="columns", copy=False)
def transform(self, X):
Xs = Parallel(n_jobs=self.n_jobs)(
delayed(_transform_one)(trans, X, None, weight)
for name, trans, weight in self._iter()
)
if not Xs:
# All transformers are None
return np.zeros((X.shape[0], 0))
if any(sparse.issparse(f) for f in Xs):
Xs = sparse.hstack(Xs).tocsr()
else:
Xs = self.merge_dataframes_by_column(Xs)
return Xs
Example use of the FeatureUnion
Exactly the same as the existing FeatureUnion.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:9 (5 by maintainers)
Top GitHub Comments
Hi @tomderuijter. Thanks for the preparation of this issue 😃
It’s interesting you mention this, we’ve been munging over this issue as well for a while.
A few things jump to mind on my side, but @MBrouns probably also has good things to mention here.
PandasFeatureUnion
but I wonder if we need it. If everything that is being concatenated is a dataframe … will scikit-learn still turn it into a numpy array?PCA(2)
being applied … what will be the column names that come out after we ensure a dataframe comes out of it? Ideally we’d have something more intelligent than colnames=[0, 1], no?In that case. I think we should wait until
get_feature_names
is a bit more mature before we explore supporting that here. Designing for it now while it is still early feels like a premature optimisation.I would be cool with a
DataFrame
transformer (it may also really help with some of our internals). In situations where the shape does not change I think keeping the names the same should suffice. In situations where they do change we can introduce a suffix likef"self.__class__{colnum}"
but it might be nice if the user has the ability to overwrite the column names. Potentially passed via the__init__(self, new_col_names=[...]
.I wonder about edges cases. Are there situations where the size of the data remains the same but the names should change?
Also … @MBrouns got strong opinions?