question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add option to compute permutation_importance on a random subset of rows

See original GitHub issue

For very large data sets, computing the feature importance on a random subset of rows instead of all rows can spead up the calculation dramatically. Since permutation importance runs several iterations per variable, it makes sense to draw a new subset each time to cover a larger part of the data.

The change would require sklearn.inspection.permutation_importance to have an additional argument specifying a maximum number of rows. This functionality is available in other packages, for example, in R: https://rdrr.io/cran/ingredients/man/feature_importance.html

Proposed solution

To be more precise I’d modify _calculate_permutation_scores like this:

def _calculate_permutation_scores(self, estimator, X, y, sample_weight, col_idx,
                                      random_state, n_repeats, scorer, max_rows):
        X_permuted = X.copy()
        y_mod = y.copy()
        n_rows = X.shape[0]
        if max_rows is not -1 & n_rows>max_rows:
            rng = np.random.default_rng()
            sample_rows = rng.choice(n_rows, max_rows)
            X_permuted = X_permuted.iloc[sample_rows]
            y_mod = y_mod.iloc[sample_rows]
        scores = np.zeros(n_repeats)
        shuffling_idx = np.arange(X_permuted.shape[0])
        for n_round in range(n_repeats):
            random_state.shuffle(shuffling_idx)
            if hasattr(X_permuted, "iloc"):
                col = X_permuted.iloc[shuffling_idx, col_idx]
                col.index = X_permuted.index
                X_permuted.iloc[:, col_idx] = col
            else:
                X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
            feature_score = _weights_scorer(
                scorer, estimator, X_permuted, y_mod, sample_weight
            )
            scores[n_round] = feature_score

        return scores

I am happy to make a PR but wanted to have the opinion from the core developers first.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
o1iv3rcommented, Jun 14, 2021

Subsampling the data before sending it to the function means calculating all statistics on this subset only, likely introducing quite a bias. Sub sampling within an iteration is cleaner imho.

An alternative is calling the function with n_repeats equal to 1 several times, manually aggregating the results. But then the parallelisation by variable becomes quite inefficient.

1reaction
cmarmocommented, Sep 13, 2022

If I’m not mistaken this issue can be close as solved by #20431. @knoam feel free to open a new one with your proposition. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.inspection.permutation_importance
The permutation importance of a feature is calculated as follows. First, a baseline metric, defined by scoring, is evaluated on a (potentially different) ......
Read more >
8.5 Permutation Feature Importance | Interpretable Machine ...
Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship ......
Read more >
Permutation Feature Importance for ML Interpretability
Compute the feature importance score by calculating the decrease in the quality of your new predictions relative to your original predictions.
Read more >
How to Calculate Feature Importance With Python
Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable....
Read more >
Random Forest Feature Importance Computed in 3 Ways with ...
The permutation based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. It is ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found