Add option to compute permutation_importance on a random subset of rows
See original GitHub issueFor very large data sets, computing the feature importance on a random subset of rows instead of all rows can spead up the calculation dramatically. Since permutation importance runs several iterations per variable, it makes sense to draw a new subset each time to cover a larger part of the data.
The change would require sklearn.inspection.permutation_importance to have an additional argument specifying a maximum number of rows. This functionality is available in other packages, for example, in R: https://rdrr.io/cran/ingredients/man/feature_importance.html
Proposed solution
To be more precise I’d modify _calculate_permutation_scores like this:
def _calculate_permutation_scores(self, estimator, X, y, sample_weight, col_idx,
random_state, n_repeats, scorer, max_rows):
X_permuted = X.copy()
y_mod = y.copy()
n_rows = X.shape[0]
if max_rows is not -1 & n_rows>max_rows:
rng = np.random.default_rng()
sample_rows = rng.choice(n_rows, max_rows)
X_permuted = X_permuted.iloc[sample_rows]
y_mod = y_mod.iloc[sample_rows]
scores = np.zeros(n_repeats)
shuffling_idx = np.arange(X_permuted.shape[0])
for n_round in range(n_repeats):
random_state.shuffle(shuffling_idx)
if hasattr(X_permuted, "iloc"):
col = X_permuted.iloc[shuffling_idx, col_idx]
col.index = X_permuted.index
X_permuted.iloc[:, col_idx] = col
else:
X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
feature_score = _weights_scorer(
scorer, estimator, X_permuted, y_mod, sample_weight
)
scores[n_round] = feature_score
return scores
I am happy to make a PR but wanted to have the opinion from the core developers first.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
sklearn.inspection.permutation_importance
The permutation importance of a feature is calculated as follows. First, a baseline metric, defined by scoring, is evaluated on a (potentially different) ......
Read more >8.5 Permutation Feature Importance | Interpretable Machine ...
Permutation feature importance measures the increase in the prediction error of the model after we permuted the feature's values, which breaks the relationship ......
Read more >Permutation Feature Importance for ML Interpretability
Compute the feature importance score by calculating the decrease in the quality of your new predictions relative to your original predictions.
Read more >How to Calculate Feature Importance With Python
Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable....
Read more >Random Forest Feature Importance Computed in 3 Ways with ...
The permutation based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. It is ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Subsampling the data before sending it to the function means calculating all statistics on this subset only, likely introducing quite a bias. Sub sampling within an iteration is cleaner imho.
An alternative is calling the function with n_repeats equal to 1 several times, manually aggregating the results. But then the parallelisation by variable becomes quite inefficient.
If I’m not mistaken this issue can be close as solved by #20431. @knoam feel free to open a new one with your proposition. Thanks!