Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Pipelining Outlier Removal

See original GitHub issue

I wonder if we could make outlier removal available in pipelines.

I tried implementing it for example using the IsolationForest but so far I couldn’t solve it and I know why.

The problem boils down to fit_transform only returning a transformed X this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X and y during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict function into some kind of transform function during training.

Investigating the pipeline implementation shows, that fit_transformis called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X). Particularly, in a cross validation fit_transform is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a “do-nothing” transform-function.

The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.

So my questions are:

Would it be interesting to contain feature removal in pipelines? Are there other more suitable ideas of implementing this feature in a pipeline?

If the content of this question is somehow inapropriate (e.g. since I’m only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.

Issue Analytics

State:
Created 6 years ago
Reactions:19
Comments:29 (25 by maintainers)

Top GitHub Comments

9reactions

adrinjalalicommented, Jun 28, 2019

I was in a project where they were using a random forest to automatically detect the outliers, before fitting a model on the rest of them. In practice it was working pretty well.
I have seen this in places where the high dimensionality of the data doesn’t allow for simple usual outlier removals. For instance, use a one class SVM, remove the outliers, then continue the job.
Not doing it in a pipeline sounds like a bad idea. Always remove my outliers after I split the train/test. In a cross validation/grid search cv scenario, this means I always do that part of it manually, cause I can’t have it in the pipeline; I never want to calculate any statistics about the data when the test data is included in it.

4reactions

amuellercommented, Jun 27, 2019

Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning? My intuition is that it would be a bad idea, and I have never heard of anyone doing it in practice. So any references to papers or applications would be great.