Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Isolation forest final stage very slow and single threaded

See original GitHub issue

Description

Isolation forest final stage very slow and single threaded.

This is an issue I get quite frequently. I’ll train an isolation forest on a decently large data set (say order 1M to 100M records, around 50 features), and it will run rapidly and in parallel with nearly 100% CPU utilization. I’ll get the output like the following:

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  30 | elapsed:   17.9s remaining:  2.7min
[Parallel(n_jobs=30)]: Done   7 out of  30 | elapsed:   18.5s remaining:  1.0min
[Parallel(n_jobs=30)]: Done  11 out of  30 | elapsed:   19.4s remaining:   33.5s
[Parallel(n_jobs=30)]: Done  15 out of  30 | elapsed:   19.7s remaining:   19.7s
[Parallel(n_jobs=30)]: Done  19 out of  30 | elapsed:   20.0s remaining:   11.6s
[Parallel(n_jobs=30)]: Done  23 out of  30 | elapsed:   20.2s remaining:    6.2s
[Parallel(n_jobs=30)]: Done  27 out of  30 | elapsed:   20.9s remaining:    2.3s
[Parallel(n_jobs=30)]: Done  30 out of  30 | elapsed:   21.5s finished

And then it will run for a very long time (10x as long? more?) on a single core, and eventually finalize. Often I’ll get progress statements all printed simultaneously at the end when the task completes:

Building estimator 1 of 3 for this parallel run (total 100)...
Building estimator 2 of 3 for this parallel run (total 100)...
Building estimator 3 of 3 for this parallel run (total 100)...
...

I presume that’s from parallel processes or threads printing to stdout without flushing.

I create the isolation forest with:

from sklearn.ensemble import IsolationForest
model_kwargs={
    'n_estimators': 100,
    'n_jobs': 30,
    'verbose': 10,
    'max_samples': 1000,
    'behaviour': "new"
}
clf = IsolationForest(**model_kwargs)

Versions

System: python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] executable: /home/ibackus/anaconda3/bin/python machine: Linux-4.15.0-1032-aws-x86_64-with-debian-buster-sid

BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/ibackus/anaconda3/lib cblas_libs: mkl_rt, pthread

Python deps: pip: 18.1 setuptools: 40.6.3 sklearn: 0.20.2 numpy: 1.15.4 scipy: 1.2.1 Cython: 0.29.2 pandas: 0.24.1

Issue Analytics

State:
Created 5 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

ngoixcommented, Mar 1, 2019

You can achieve this by setting contamination="auto" and behaviour="new"

0reactions

jnothmancommented, Mar 4, 2019

I’ll close this, given solutions presented, work merged since and work under way… let me know if that is a mistake.

Top Results From Across the Web

Isolation-based Outlier Detection — isotree documentation

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/ ...

Anomaly Detection for Data Streams Based on Isolation Forest ...

Isolation Forest is a state-of-the-art algorithm for anomaly detection and the only ensemble method in scikit-learn's and widely used by the commu-.

Outlier Detection with Extended Isolation Forest

Isolation Forest algorithm utilizes the fact that anomalous observations are few and significantly different from 'normal' observations.

isotree: Isolation-Based Outlier Detection

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/ ...

isolation.forest function - RDocumentation

Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some ...