Improvements for FeatureImportances visualizer
See original GitHub issueThere are a couple of enhancements for the yellowbrick.features.FeatureImportances
visualizer that should be made to really make it stick out. They are as follows:
Note to contributors: items in the below checklist don’t need to be completed in a single PR; if you see one that catches your eye, feel to pick it off the list!
- color negative coefs
- top n features to filter number displayed (both pos and neg)
- implement standard deviation for ensemble models
Color negative coefs
The first item is relatively straightforward, currently, the bar chart is a single color, but it might be nice to show negative coef_
values as a different color, e.g. blue for positive green for negative as below:
To do this, you’ll have to create a color array to pass as the color argument, e.g.
colors = np.array(['b' if v > 0 else 'g' for v in self.feature_importances_])
self.ax.barh(pos, self.feature_importances_, color=colors, align='center')
We should also create arguments to provide a way to specify the colors.
Top N Features
For the second item, I’m picturing something similar to most informative features with scikit-learn (though not exactly this code). Here, an argument topn
which defaults to None specifies a filter to only plot the N
best features.
This should also be relatively straightforward, but gets complicated in the case of negative values. We have two options, we can rank all values including negative values and plot the N best values either positive or negative, or we can do the N best positive and N best negative coefficients
Standard Deviation for Ensembles
Ensemble models like Random Forest and Gradient Boosting have an underlying estimators_
attribute that describes each feature’s importance in a different way. The global feature importances are the mean, but it would be nice to add an xerr
bar with the standard deviation as in plot forest importances.
This could also be useful for CV
models that also have an underlying estimators_
attribute.
The idea with this one is to compute the standard deviations for each feature using estimators_
and np.std, and storing the value in a confidences_
attribute during fit. Note that it will also have to be sorted using the sort_index
– the confidences are drawn during ax.barh
with xerr=self.confidences_
.
Right now it looks like the example above is no longer working exactly as expected, so some deeper review is necessary.
See also #194 where a discussion about tree-specific feature importances is ongoing.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top GitHub Comments
Hi, I know that this is an old issue, but I can’t find an implementation of the top-n feature discussed here. Would a top-n parameter (or even a separate visualizer) be a useful feature? I use the library regularly, and it’s a feature that I recently needed for a paper.
Hello yellowbrick! In the interest of legitimate hacktoberfest PRs I took at stab at the top_n feature for this
Please see PR #1102 for my very quick implementation of this feature. I’m open to suggestions on how to fully flesh this out. I did not try out many combinations of parameters, so there are scenarios where
top_n
may or may not even apply which I have not even considered.