Isolation forest final stage very slow and single threaded
See original GitHub issueDescription
Isolation forest final stage very slow and single threaded.
This is an issue I get quite frequently. I’ll train an isolation forest on a decently large data set (say order 1M to 100M records, around 50 features), and it will run rapidly and in parallel with nearly 100% CPU utilization. I’ll get the output like the following:
[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done 3 out of 30 | elapsed: 17.9s remaining: 2.7min
[Parallel(n_jobs=30)]: Done 7 out of 30 | elapsed: 18.5s remaining: 1.0min
[Parallel(n_jobs=30)]: Done 11 out of 30 | elapsed: 19.4s remaining: 33.5s
[Parallel(n_jobs=30)]: Done 15 out of 30 | elapsed: 19.7s remaining: 19.7s
[Parallel(n_jobs=30)]: Done 19 out of 30 | elapsed: 20.0s remaining: 11.6s
[Parallel(n_jobs=30)]: Done 23 out of 30 | elapsed: 20.2s remaining: 6.2s
[Parallel(n_jobs=30)]: Done 27 out of 30 | elapsed: 20.9s remaining: 2.3s
[Parallel(n_jobs=30)]: Done 30 out of 30 | elapsed: 21.5s finished
And then it will run for a very long time (10x as long? more?) on a single core, and eventually finalize. Often I’ll get progress statements all printed simultaneously at the end when the task completes:
Building estimator 1 of 3 for this parallel run (total 100)...
Building estimator 2 of 3 for this parallel run (total 100)...
Building estimator 3 of 3 for this parallel run (total 100)...
...
I presume that’s from parallel processes or threads printing to stdout without flushing.
I create the isolation forest with:
from sklearn.ensemble import IsolationForest
model_kwargs={
'n_estimators': 100,
'n_jobs': 30,
'verbose': 10,
'max_samples': 1000,
'behaviour': "new"
}
clf = IsolationForest(**model_kwargs)
Versions
System: python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0] executable: /home/ibackus/anaconda3/bin/python machine: Linux-4.15.0-1032-aws-x86_64-with-debian-buster-sid
BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/ibackus/anaconda3/lib cblas_libs: mkl_rt, pthread
Python deps: pip: 18.1 setuptools: 40.6.3 sklearn: 0.20.2 numpy: 1.15.4 scipy: 1.2.1 Cython: 0.29.2 pandas: 0.24.1
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
You can achieve this by setting
contamination="auto"
andbehaviour="new"
I’ll close this, given solutions presented, work merged since and work under way… let me know if that is a mistake.