Feature Request: Pipelining Outlier Removal
See original GitHub issueI wonder if we could make outlier removal available in pipelines.
I tried implementing it for example using the IsolationForest but so far I couldn’t solve it and I know why.
The problem boils down to fit_transform
only returning a transformed X
this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X
and y
during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict
function into some kind of transform
function during training.
Investigating the pipeline implementation shows, that fit_transform
is called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X)
. Particularly, in a cross validation fit_transform
is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a “do-nothing” transform
-function.
The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.
So my questions are:
Would it be interesting to contain feature removal in pipelines? Are there other more suitable ideas of implementing this feature in a pipeline?
If the content of this question is somehow inapropriate (e.g. since I’m only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:19
- Comments:29 (25 by maintainers)
Top GitHub Comments
Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning? My intuition is that it would be a bad idea, and I have never heard of anyone doing it in practice. So any references to papers or applications would be great.