Feature Request: Include p-values attribute for logistic regression
See original GitHub issueScikitlearn
is the de facto home for all kinds of modeling algorithms. It has a plethora of
algorithms but still one thing that seems to be missing is the implementation of LogisticRegression
where we can have p-values.
It would be great if we have something like model.p_values_
attribute for the Logistic Regression Models.
I know that there is another statistical library statsmodels
which provides p_values, but a lot of programmers use sklearn
and they build models based on this library. It is somewhat
inconvenient to use statsmodels
just to get p-values and run other models such as Random Forest
in sklearn
.
Afterall, the API of statsmodels
and sklearn
are quite different. sklean
is trend setter and
most people feel comfortable with sklearn
API, however, statsmodels
follows R-programming API and they are quite different.
In conclusion, It would be great if sklearn
provides p-values
for linear models.
I am eagerly waiting for the implementation in future versions of sklearn
.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:16 (13 by maintainers)
Top GitHub Comments
Considering how this request gets repeated pretty frequently, could someone chime in with some clearer justification that could be added to the documentation? Or this feature could find a home in scikit-learn contrib and then could be linked to? I’m unclear on the reasoning for not including p-values.
Reasoning in #6773:
First, scikit-learn exposes statistical tests via the feature_selection module, and they’re very useful. It’s not like scikit-learn does no stats.
Second, I would argue that p-values are part of model interpretation, and scikit-learn has an inspection module that has model interpretation capabilities. Scikit-learn supports model intepretation for tree-based models with feature_importances_. Granted, linear models expose the coefficients_. But why not add more interpretation capabilities if that’s what people want?
This comment implies that p-values should not be added because there’s uncertainty on how when p-values are reliable. The same comment could be made about other parts of the library, such as trusting the feature importances of tree-based models in a multicollinear dataset. From #16860:
I think this philosphy applies to p-values. They are generally accepted as a useful statistical technique when used in the proper way. It’s up to the user to use them appropriately. But people definitely want this feature (myself included 😃 ).
Assumptions differ whether they are about control on the coefficients or on the prediction. The control on the prediction is much more lax then the control on the coefficients.
No. No need to duplicate functionality across packages.