question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Pipelining Outlier Removal

See original GitHub issue

I wonder if we could make outlier removal available in pipelines.

I tried implementing it for example using the IsolationForest but so far I couldn’t solve it and I know why.

The problem boils down to fit_transform only returning a transformed X this suffices in the vast majority of cases, since we typically only throw away columns (think of a PCA). However, using outlier removal in a pipeline, we need to throw away rows of X and y during training and do nothing during testing. This is not supported so far. Essentially, we would need to turn the predict function into some kind of transform function during training.

Investigating the pipeline implementation shows, that fit_transformis called if present during the fitting part of the pipeline, rather than fit(X, y).transform(X). Particularly, in a cross validation fit_transform is only called during training. This would be perfect for outlier removal. However, it remains to do nothing in the test step. But to this end we can simply implement a “do-nothing” transform-function.

The most direct way to implement this, would be an API-change of the TransformerMixin-class, unfortunately.

So my questions are:

Would it be interesting to contain feature removal in pipelines? Are there other more suitable ideas of implementing this feature in a pipeline?

If the content of this question is somehow inapropriate (e.g. since I’m only an active user, not an active developer of the project) or at the wrong place, feel free to remove the thread.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:19
  • Comments:29 (25 by maintainers)

github_iconTop GitHub Comments

9reactions
adrinjalalicommented, Jun 28, 2019
  • I was in a project where they were using a random forest to automatically detect the outliers, before fitting a model on the rest of them. In practice it was working pretty well.
  • I have seen this in places where the high dimensionality of the data doesn’t allow for simple usual outlier removals. For instance, use a one class SVM, remove the outliers, then continue the job.
  • Not doing it in a pipeline sounds like a bad idea. Always remove my outliers after I split the train/test. In a cross validation/grid search cv scenario, this means I always do that part of it manually, cause I can’t have it in the pipeline; I never want to calculate any statistics about the data when the test data is included in it.
4reactions
amuellercommented, Jun 27, 2019

Can anyone provide an example where this was done in practice? Or any paper evaluating using automatic outlier removal for supervised learning? My intuition is that it would be a bad idea, and I have never heard of anyone doing it in practice. So any references to papers or applications would be great.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feature Engineering - Imputation, Scaling, Outliers | Devportal
Remove the outlier records: This approach may reduce the number of available records for AI training and harm model performance. Replace ...
Read more >
Removing Outliers within a Pipeline - Kaggle
The goal of this short notebook is to show how to use a custom function for outlier removal within a pipeline. Including this...
Read more >
Feature selection before or after outlier processing? - Reddit
Pros: straightforward into the pipeline + infrastructure simplicity. Cons: we may've removed rows on the outlier processing step due to values ...
Read more >
4 Automatic Outlier Detection Algorithms in Python
Identifying and removing outliers is challenging with simple statistical methods for most machine learning datasets given the large number of ...
Read more >
Detect and Remove the Outliers in a Dataset | by Dilip Valeti
We will load the dataset and separate out the features and targets. ... there is some suggestion for removing the outlier or not....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found