IsolationForest extremely slow with large number of columns having discrete values
See original GitHub issueThe following example takes an unreasonable amount of time to run:
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_rcv1
from scipy.sparse import csc_matrix, csr_matrix
X, y = fetch_rcv1(return_X_y=True)
X = csc_matrix(X)
X.sort_indices()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X)
In theory it should be very fast, since each sub-sample it takes is a small sparse matrix in which most columns will have only zeros.
Using n_jobs>1
also makes it use a very unreasonable amount of memory for some reason.
If the input is passed as dense, the running time still looks worse than it should. I guess the issue from a quick glance at the code is that it doesn’t remember which columns are already not possible to split in a given node:
X_sample = csr_matrix(X)[:1000,:]
X_sample = X_sample.toarray()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X_sample)
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (8 by maintainers)
Top Results From Across the Web
is there a faster way for Training IsolationForest on each row ...
I want to use IsolationForest for anomaly detection on each row of my data. First I convert each row to a new DataFrame...
Read more >sklearn.ensemble.IsolationForest
The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of ......
Read more >Outlier Detection: Isolation Forest
The Isolation Forest algorithm is related to the well-known Random Forest ... forest is roughly 50 times slower, a considerable penalty.
Read more >All Models - pyod 1.0.7 documentation
'default': original ABOD with all training points, which could be slow ... tend to have higher scores. This value is available once the...
Read more >Extending Isolation Forest for Anomaly Detection in Big Data ...
To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here is an edited version of your reproducer to get more precise measurements:
I can then use memory_profiler to monitory memory usage over time. Here is what I observe:
n_jobs = 1
n_jobs = 4
so a peak memory usage of 5 GB which indeed seems quite large but maybe expected: the resampling from the RF bagging performed by each model in parallel is causing the dataset to be replicated once per parallel tree. As the original data is already 731.2 MB, 5 GB in total memory usage does not sound too catastrophic to me: 1 x 731 (original data) + 4 x 731 (concurrent resampled copies) ~ 3GB. Assuming that the Python GC lags a bit, getting a few extra GB.
Also I am not sure what is going one when you row-wise resample a scipy CSC matrix but I suspect there might be intermediate data structures being materialized.
As for execution time, I don’t really know if ~25s is abnormally slow or not to fit 100 isolation trees on this sparse data.
That sounds like an interesting optimization avenue, indeed. A PR would be appreciated, I think 😃
I think one of the reasons is that DecisionTree has a
check_input
which is set to True by default.RandomForest
is derived fromBaseForest
which uses_parallel_build_trees
to construct underlying estimators, and it setscheck_input
to False during the process:meanwhile
IsoForest
is derived fromBaseBagging
which uses_parallel_build_estimators
, and it does not setcheck_input
to False, so we will be checking X repetitively:The slicing operation
X[:, features]
is also pretty slow, fitting 10 estimators with n_jobs=1 takes 10s, addingcheck_input=False
takes around 6s, and removing the slicing operation (incorrectly) takes under 1s.Maybe we should change
IsolationForest
to derive fromBaseForest
?