Implement Extended Isolation Forest
See original GitHub issueDescribe the workflow you want to enable
In the context of anomaly detection, the isolation forest algorithm has a bias making data points’ anomaly scores lower than what they actually should be. This problem arises for areas in the space which are axis-aligned with a cluster. Imagine a point very far from a cluster, the basic Isolation Forest algorithm may assign it a lower anomaly score only because the point is axis aligned with the cluster. This does lead to false negative in my application.
To overcome this bias, Hariri et al proposed the Extended Isolation Forest algorithm. While the normal Isolation Forests algorithm randomly chooses a feature and a threshold value to split the points, the extended version uses a random hyperplane to do so. Those random hyperplanes, as they might not be axis-aligned, remove the bias caused by the standard algorithm. In the end, the standard algorithm becomes a subset of the extended version one, but using only axis-aligned hyperplanes.
Please have a look to the original paper, it explains the problem very well.
Describe your proposed solution
I had a look to the Isolation Forest code, in my humble opinion, the simplest solution might be to add an argument to the IsolationForest
class constructor to choose how the samples should be split into two. Maybe something like an extended
argument.
This would basically modify the splitter
argument passed to the underlying base_estimator (an ExtraTreeRegressor instance). We could add a “random_hyperplane” splitting mode, which requires implementing a new Splitter
class.
Globally, this is not a lot of changes but adding the new splitter class. I can work on an implementation if we agree this is a useful addition.
Describe alternatives you’ve considered, if relevant
Maybe the Extended Isolation Forest algorithm should be a distinct class, but I doubt it is worth it.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:8
- Comments:13 (6 by maintainers)
Top GitHub Comments
@thomasjpfan just to be clear, you’d be open to a random hyperplane splitter, if it could be used for isolation forests, extra trees, and random forests?
Only splitting along an axis is a major limitation of tree-based algorithms, especially for anomaly detection. It’d be really nice to have this functionality in sklearn.
Considering the problematic bias, maybe it would be better to have the “extended” algorithm replace the original, unless you can conceive a situation where you’d still want the original. The extended algorithm looks consistently better in the paper.