Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Isolation forest - decision_function & average_path_length method are memory inefficient

See original GitHub issue

Description

Isolation forest consumes too much memory due to memory ineffecient implementation of anomoly score calculation. Due to this the parallelization with n_jobs is also impacted as anomoly score cannot be calculated in parallel for each tree.

Steps/Code to Reproduce

Run a simple Isolation forest with n_estimators as 10 and as 50 respectively. On memory profiling, it can be seen that each building of tree is not taking much memory but in the end a lot of memory is consumed as a for loop is iteration over all trees and calculating the anomoly score of all trees together and then averaging it. -iforest.py line 267-281

        for i, (tree, features) in enumerate(zip(self.estimators_,
                                                 self.estimators_features_)):
            if subsample_features:
                X_subset = X[:, features]
            else:
                X_subset = X
            leaves_index = tree.apply(X_subset)
            node_indicator = tree.decision_path(X_subset)
            n_samples_leaf[:, i] = tree.tree_.n_node_samples[leaves_index]
            depths[:, i] = np.ravel(node_indicator.sum(axis=1))
            depths[:, i] -= 1

        depths += _average_path_length(n_samples_leaf)

        scores = 2 ** (-depths.mean(axis=1) / _average_path_length(self.max_samples_))

        # Take the opposite of the scores as bigger is better (here less
        # abnormal) and add 0.5 (this value plays a special role as described
        # in the original paper) to give a sense to scores = 0:
        return 0.5 - scores

Due to this, in case of more no. of estimators(1000), the memory consumed is quite high.

Expected Results

Possible Solution: The above for loop should only do the averaging of anomoly score from each estimator instead of calculation. The logic of isoforest anomoly score calculation can be moved to base estimator class so it is done for each tree( i guess bagging.py file-similar to other method available after fitting)

Actual Results

The memory consumption is profound as we increase no. of estimators.

model=Isolationforest()
model.fit(data)

The fit method calls decision function & average anomoly score which are taking quite a lot memory. the memory spike is too high in the very end, that is in finall call to average_path_length() method.

depths += _average_path_length(n_samples_leaf)

Versions

isoForest_memoryConsumption.docx

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

anant9commented, Sep 13, 2018

Hello, yes that exactly the issue with isolation forest. The dataset is indeed large 257K samples with 35 numerical features. However, that even needs to be more than that as per my needs and so looking for memory efficiency too in addition to time.

I have gone through the links and they are quite useful to my specific usecases(I was even facing memory issues with sillloutte score and brute algo). I’m also exploring dask package that works on chunks using dask arrays/dataframes and if can alternatively be used in places where sklearn is consuming memory.

Will be first working on handling the data with chunks and probably in coming weeks will be making the PR for isoforest modification as have to go through the research paper on iso forest algo too. Also looking for other packages/languages than sklearn as how they are doing isoforest. Here’s the bagging implementation seems quite different, i.e. I think the tree is getting build for each sample instead of simply making n_estimators tree and then apply on each sample- In any case I have to understand few other things before starting work/discussion on this in detail.

0reactions

ngoixcommented, Feb 25, 2019

working on this for the sprint. So to avoid arrays of shape (n_samples, n_estimators) in memory, we can either:

updating the average when going through the estimators which will decrease the in-memory shape down to (n_samples,)
chunk the samples row wise

We can also do both options I guess. I’m not sure if 1) can be done easily though, looking into it.

Top Results From Across the Web

Research and Improvement of Isolation Forest in ... - IOPscience

At the same time, its algorithm is sensitive to anomalies with short average path length, causing problems that cannot identifiy local anomalies with...

sklearn.ensemble.IsolationForest

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces...

Research and Improvement of Isolation Forest in Detection of ...

Isolation Forest algorithm in detecting local anomaly points. ... the average path length of the data items and normalizes them (as shown in ......

All Models - pyod 1.0.7 documentation

'default': original ABOD with all training points, which could be slow ... This path length, averaged over a forest of such random trees,...

Isolation Forest - Medium

Interesting note: Random partitioning produces noticeably shorter paths for anomalies, hence when the forest of random trees collectively ...