question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Isolation forest final stage very slow and single threaded

See original GitHub issue

Description

Isolation forest final stage very slow and single threaded.

This is an issue I get quite frequently. I’ll train an isolation forest on a decently large data set (say order 1M to 100M records, around 50 features), and it will run rapidly and in parallel with nearly 100% CPU utilization. I’ll get the output like the following:

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  30 | elapsed:   17.9s remaining:  2.7min
[Parallel(n_jobs=30)]: Done   7 out of  30 | elapsed:   18.5s remaining:  1.0min
[Parallel(n_jobs=30)]: Done  11 out of  30 | elapsed:   19.4s remaining:   33.5s
[Parallel(n_jobs=30)]: Done  15 out of  30 | elapsed:   19.7s remaining:   19.7s
[Parallel(n_jobs=30)]: Done  19 out of  30 | elapsed:   20.0s remaining:   11.6s
[Parallel(n_jobs=30)]: Done  23 out of  30 | elapsed:   20.2s remaining:    6.2s
[Parallel(n_jobs=30)]: Done  27 out of  30 | elapsed:   20.9s remaining:    2.3s
[Parallel(n_jobs=30)]: Done  30 out of  30 | elapsed:   21.5s finished

And then it will run for a very long time (10x as long? more?) on a single core, and eventually finalize. Often I’ll get progress statements all printed simultaneously at the end when the task completes:

Building estimator 1 of 3 for this parallel run (total 100)...
Building estimator 2 of 3 for this parallel run (total 100)...
Building estimator 3 of 3 for this parallel run (total 100)...
...

I presume that’s from parallel processes or threads printing to stdout without flushing.

I create the isolation forest with:

from sklearn.ensemble import IsolationForest
model_kwargs={
    'n_estimators': 100,
    'n_jobs': 30,
    'verbose': 10,
    'max_samples': 1000,
    'behaviour': "new"
}
clf = IsolationForest(**model_kwargs)

Versions

System: python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] executable: /home/ibackus/anaconda3/bin/python machine: Linux-4.15.0-1032-aws-x86_64-with-debian-buster-sid

BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/ibackus/anaconda3/lib cblas_libs: mkl_rt, pthread

Python deps: pip: 18.1 setuptools: 40.6.3 sklearn: 0.20.2 numpy: 1.15.4 scipy: 1.2.1 Cython: 0.29.2 pandas: 0.24.1

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ngoixcommented, Mar 1, 2019

You can achieve this by setting contamination="auto" and behaviour="new"

0reactions
jnothmancommented, Mar 4, 2019

I’ll close this, given solutions presented, work merged since and work under way… let me know if that is a mistake.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Isolation-based Outlier Detection — isotree documentation
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/ ...
Read more >
Anomaly Detection for Data Streams Based on Isolation Forest ...
Isolation Forest is a state-of-the-art algorithm for anomaly detection and the only ensemble method in scikit-learn's and widely used by the commu-.
Read more >
Outlier Detection with Extended Isolation Forest
Isolation Forest algorithm utilizes the fact that anomalous observations are few and significantly different from 'normal' observations.
Read more >
isotree: Isolation-Based Outlier Detection
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some attribute/ ...
Read more >
isolation.forest function - RDocumentation
Isolation Forest is an algorithm originally developed for outlier detection that consists in splitting sub-samples of the data according to some ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found