Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feature_importances_ should be a method in the ideal design

See original GitHub issue

This issue is not meant to be very practical, just a place to share my thoughts.

I believe feature_importances_ should have been designed as get_feature_importances() (which is, perhaps, funny because I think the get_feature_names design is pretty broken too), for the following reasons:

calculating feature importances can be costly, and should not (and is not in some cases) be calculated at fit time unnecessary
there are often multiple ways to calculate feature importances (as simply as choice of norm for coef_), and (as long as they depend on the same sufficient statistics) the user may fairly not decide which is appropriate until after fit. Thus get_feature_importances could have parameters to choose its method. Meta-estimators such as SelectFromModel and RFE currently have parameters for how they should interpret coef_ as feature importances, but really these are parameters that should be passed to the linear model’s get_feature_importances; the model itself should know how to summarise its coef_, and doing so gets more complicated once we have multi-output coef_.
it is semantically different from other attributes, not being a sufficient statistic upon which basis the estimator makes predictions

I don’t think there is currently sufficient motivation to change, but I could be persuaded.

Ping @kmike?

Issue Analytics

State:
Created 6 years ago
Reactions:4
Comments:43 (41 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Oct 10, 2018

#12326 is another example of needing to configure a norm for the coef_, where that configuration needs to be passed through a metaestimator (as in RFE and SelectFromModel also).

1reaction

thomasjpfancommented, Jun 20, 2019

More Generic SelectFromModel API Proposal

We can extend the SelectFromModel API to have a feature_importance parameter that can accept a callable:

def pca_importances(estimator, **kwargs):
    return np.abs(estimator.components_.ravel())

sfm = SelectFromModel(PCA(n_components=1), importance=pca_importances)

The default value for feature_importance will be 'auto' to keep the current behavior.

Thoughts on permutation importance

Permutation Idea 1

Now for permutation importance, it would be extremely nice to have feature_importance='permutation' and have it magically work. The permutation importance needs the data, which means it can not support prefit=True and the importances must be calculated during fit. Furthermore, permutation importance accepts a scoring parameter. This means the SelectFromModel api may look like this:

sfm = SelectFromModel(MLPClassifer(), 
                     importance='permutation',
					 scoring='auc', # only used when importance='permutation'
					 n_jobs=4, # only used when importance='permutation'
                     n_repeats=10,
)

Permutation Idea 2

We can go the other way and have users right a custom function for permutation importance:

def perm_importance(estimator, X, y):
    return permutation_importance(estimator, X, y, scoring='auc', n_jobs=4
								  n_repeats=10)['mean']

sfm = SelectFromModel(MLPClassifer(), 
                     importance=perm_importance,
                     pass_data=True)

The fitted estimator, X, and Y will get passed to perm_importance during fit. prefit=True will not be supported