Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IsolationForest extremely slow with large number of columns having discrete values

See original GitHub issue

The following example takes an unreasonable amount of time to run:

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_rcv1
from scipy.sparse import csc_matrix, csr_matrix

X, y = fetch_rcv1(return_X_y=True)
X = csc_matrix(X)
X.sort_indices()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X)

In theory it should be very fast, since each sub-sample it takes is a small sparse matrix in which most columns will have only zeros.

Using n_jobs>1 also makes it use a very unreasonable amount of memory for some reason.

If the input is passed as dense, the running time still looks worse than it should. I guess the issue from a quick glance at the code is that it doesn’t remember which columns are already not possible to split in a given node:

X_sample = csr_matrix(X)[:1000,:]
X_sample = X_sample.toarray()
iso = IsolationForest(n_estimators=100, max_samples=256).fit(X_sample)

Issue Analytics

State:
Created 3 years ago
Comments:10 (8 by maintainers)

Top GitHub Comments

2reactions

ogriselcommented, Jan 27, 2021

Here is an edited version of your reproducer to get more precise measurements:

import sys
import os
import pickle
import numpy as np
from time import perf_counter
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_rcv1
from scipy.sparse import csc_matrix, csr_matrix

X, y = fetch_rcv1(return_X_y=True)
X = csc_matrix(X)
X.sort_indices()

data_size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes
print(f"data size: {data_size / 1e6:.1f} MB")

n_jobs = int(sys.argv[1])
print(f"Running IsolationForest with n_jobs={n_jobs}...")
tic = perf_counter()
iso = IsolationForest(n_estimators=100, max_samples=256, n_jobs=n_jobs).fit(X)
print(f"duration: {perf_counter() - tic:.1f} s")


fname = "/tmp/tmp_model.pkl"
with open(fname, "wb") as f:
    pickle.dump(iso, f)

model_size = os.stat(fname).st_size
print(f"final model size: {model_size / 1e6:.1f} MB")

I can then use memory_profiler to monitory memory usage over time. Here is what I observe:

n_jobs = 1

(dev) ogrisel@arm64-apple-darwin20 /tmp % mprof run iso.py 1
mprof: Sampling memory every 0.1s
running new process
data size: 731.2 MB
Running IsolationForest with n_jobs=1...
duration: 24.1 s
final model size: 38.0 MB
(dev) ogrisel@arm64-apple-darwin20 /tmp % mprof plot

isoforest_rcv1_csc_n_jobs_1

n_jobs = 4


(dev) ogrisel@arm64-apple-darwin20 /tmp % mprof run iso.py 4
mprof: Sampling memory every 0.1s
running new process
data size: 731.2 MB
Running IsolationForest with n_jobs=4...
duration: 18.3 s
final model size: 38.0 MB
(dev) ogrisel@arm64-apple-darwin20 /tmp % mprof plot

isoforest_rcv1_csc_n_jobs_4

so a peak memory usage of 5 GB which indeed seems quite large but maybe expected: the resampling from the RF bagging performed by each model in parallel is causing the dataset to be replicated once per parallel tree. As the original data is already 731.2 MB, 5 GB in total memory usage does not sound too catastrophic to me: 1 x 731 (original data) + 4 x 731 (concurrent resampled copies) ~ 3GB. Assuming that the Python GC lags a bit, getting a few extra GB.

Also I am not sure what is going one when you row-wise resample a scipy CSC matrix but I suspect there might be intermediate data structures being materialized.

As for execution time, I don’t really know if ~25s is abnormally slow or not to fit 100 isolation trees on this sparse data.

I guess the issue from a quick glance at the code is that it doesn’t remember which columns are already not possible to split in a given node:

That sounds like an interesting optimization avenue, indeed. A PR would be appreciated, I think 😃

1reaction

MaxwellLZHcommented, Apr 1, 2021

I think one of the reasons is that DecisionTree has a check_input which is set to True by default. RandomForest is derived from BaseForest which uses _parallel_build_trees to construct underlying estimators, and it sets check_input to False during the process:

# sklearn.ensemble._forest.py
def _parallel_build_trees(tree, forest, X, y, sample_weight, tree_idx, n_trees,
                          verbose=0, class_weight=None,
                          n_samples_bootstrap=None):
    """
    Private function used to fit a single tree in parallel."""
   ......
        tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
    else:
        tree.fit(X, y, sample_weight=sample_weight, check_input=False)
    return tree

meanwhile IsoForest is derived from BaseBagging which uses _parallel_build_estimators, and it does not set check_input to False, so we will be checking X repetitively:

# sklearn.ensemble._bagging.py
def _parallel_build_estimators(n_estimators, ensemble, X, y, sample_weight,
                               seeds, total_n_estimators, verbose):
    """Private function used to build a batch of estimators within a job."""
   ......
    for i in range(n_estimators):
        random_state = seeds[i]
        estimator = ensemble._make_estimator(append=False,
                                             random_state=random_state)

        # Draw random feature, sample indices
        features, indices = _generate_bagging_indices(random_state,
                                                      bootstrap_features,
                                                      bootstrap, n_features,
                                                      n_samples, max_features,
                                                      max_samples)
         .......
            estimator.fit(X[:, features], y, sample_weight=curr_sample_weight)

        else:
            estimator.fit((X[indices])[:, features], y[indices])

        estimators.append(estimator)
        estimators_features.append(features)

    return estimators, estimators_features

The slicing operation X[:, features] is also pretty slow, fitting 10 estimators with n_jobs=1 takes 10s, adding check_input=False takes around 6s, and removing the slicing operation (incorrectly) takes under 1s.

Maybe we should change IsolationForest to derive from BaseForest?

Top Results From Across the Web

is there a faster way for Training IsolationForest on each row ...

I want to use IsolationForest for anomaly detection on each row of my data. First I convert each row to a new DataFrame...

sklearn.ensemble.IsolationForest

The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of ......

Outlier Detection: Isolation Forest

The Isolation Forest algorithm is related to the well-known Random Forest ... forest is roughly 50 times slower, a considerable penalty.

All Models - pyod 1.0.7 documentation

'default': original ABOD with all training points, which could be slow ... tend to have higher scores. This value is available once the...

Extending Isolation Forest for Anomaly Detection in Big Data ...

To ensure security to the computer systems in an industrial environment, it is required to build effective intrusion detection systems to ...