Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improvements for FeatureImportances visualizer

See original GitHub issue

There are a couple of enhancements for the yellowbrick.features.FeatureImportances visualizer that should be made to really make it stick out. They are as follows:

Note to contributors: items in the below checklist don’t need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!

color negative coefs
top n features to filter number displayed (both pos and neg)
implement standard deviation for ensemble models

Color negative coefs

The first item is relatively straightforward, currently, the bar chart is a single color, but it might be nice to show negative coef_ values as a different color, e.g. blue for positive green for negative as below:

figure_1

To do this, you’ll have to create a color array to pass as the color argument, e.g.

colors = np.array(['b' if v > 0 else 'g' for v in self.feature_importances_]) 
self.ax.barh(pos, self.feature_importances_, color=colors, align='center')

We should also create arguments to provide a way to specify the colors.

Top N Features

For the second item, I’m picturing something similar to most informative features with scikit-learn (though not exactly this code). Here, an argument topn which defaults to None specifies a filter to only plot the N best features.

This should also be relatively straightforward, but gets complicated in the case of negative values. We have two options, we can rank all values including negative values and plot the N best values either positive or negative, or we can do the N best positive and N best negative coefficients

Standard Deviation for Ensembles

Ensemble models like Random Forest and Gradient Boosting have an underlying estimators_ attribute that describes each feature’s importance in a different way. The global feature importances are the mean, but it would be nice to add an xerr bar with the standard deviation as in plot forest importances.

This could also be useful for CV models that also have an underlying estimators_ attribute.

The idea with this one is to compute the standard deviations for each feature using estimators_ and np.std, and storing the value in a confidences_ attribute during fit. Note that it will also have to be sorted using the sort_index – the confidences are drawn during ax.barh with xerr=self.confidences_.

Right now it looks like the example above is no longer working exactly as expected, so some deeper review is necessary.

See also #194 where a discussion about tree-specific feature importances is ongoing.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

JonoCXcommented, Jun 20, 2020

Hi, I know that this is an old issue, but I can’t find an implementation of the top-n feature discussed here. Would a top-n parameter (or even a separate visualizer) be a useful feature? I use the library regularly, and it’s a feature that I recently needed for a paper.

0reactions

mgarodcommented, Oct 5, 2020

Hello yellowbrick! In the interest of legitimate hacktoberfest PRs I took at stab at the top_n feature for this

Please see PR #1102 for my very quick implementation of this feature. I’m open to suggestions on how to fully flesh this out. I did not try out many combinations of parameters, so there are scenarios where top_n may or may not even apply which I have not even considered.