Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add option to compute permutation_importance on a random subset of rows

See original GitHub issue

For very large data sets, computing the feature importance on a random subset of rows instead of all rows can spead up the calculation dramatically. Since permutation importance runs several iterations per variable, it makes sense to draw a new subset each time to cover a larger part of the data.

The change would require sklearn.inspection.permutation_importance to have an additional argument specifying a maximum number of rows. This functionality is available in other packages, for example, in R: https://rdrr.io/cran/ingredients/man/feature_importance.html

Proposed solution

To be more precise I’d modify _calculate_permutation_scores like this:

def _calculate_permutation_scores(self, estimator, X, y, sample_weight, col_idx,
                                      random_state, n_repeats, scorer, max_rows):
        X_permuted = X.copy()
        y_mod = y.copy()
        n_rows = X.shape[0]
        if max_rows is not -1 & n_rows>max_rows:
            rng = np.random.default_rng()
            sample_rows = rng.choice(n_rows, max_rows)
            X_permuted = X_permuted.iloc[sample_rows]
            y_mod = y_mod.iloc[sample_rows]
        scores = np.zeros(n_repeats)
        shuffling_idx = np.arange(X_permuted.shape[0])
        for n_round in range(n_repeats):
            random_state.shuffle(shuffling_idx)
            if hasattr(X_permuted, "iloc"):
                col = X_permuted.iloc[shuffling_idx, col_idx]
                col.index = X_permuted.index
                X_permuted.iloc[:, col_idx] = col
            else:
                X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
            feature_score = _weights_scorer(
                scorer, estimator, X_permuted, y_mod, sample_weight
            )
            scores[n_round] = feature_score

        return scores

I am happy to make a PR but wanted to have the opinion from the core developers first.

Issue Analytics

State:
Created 2 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

2reactions

o1iv3rcommented, Jun 14, 2021

Subsampling the data before sending it to the function means calculating all statistics on this subset only, likely introducing quite a bias. Sub sampling within an iteration is cleaner imho.

An alternative is calling the function with n_repeats equal to 1 several times, manually aggregating the results. But then the parallelisation by variable becomes quite inefficient.

1reaction

cmarmocommented, Sep 13, 2022

If I’m not mistaken this issue can be close as solved by #20431. @knoam feel free to open a new one with your proposition. Thanks!

Top Results From Across the Web

sklearn.inspection.permutation_importance

The permutation importance of a feature is calculated as follows. First, a baseline metric, defined by scoring, is evaluated on a (potentially different) ......

8.5 Permutation Feature Importance | Interpretable Machine ...

Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship ......

Permutation Feature Importance for ML Interpretability

Compute the feature importance score by calculating the decrease in the quality of your new predictions relative to your original predictions.

How to Calculate Feature Importance With Python

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable....

Random Forest Feature Importance Computed in 3 Ways with ...

The permutation based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. It is ...