Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Include p-values attribute for logistic regression

See original GitHub issue

Scikitlearn is the de facto home for all kinds of modeling algorithms. It has a plethora of algorithms but still one thing that seems to be missing is the implementation of LogisticRegression where we can have p-values.

It would be great if we have something like model.p_values_ attribute for the Logistic Regression Models.

I know that there is another statistical library statsmodels which provides p_values, but a lot of programmers use sklearn and they build models based on this library. It is somewhat inconvenient to use statsmodels just to get p-values and run other models such as Random Forest in sklearn.

Afterall, the API of statsmodels and sklearn are quite different. sklean is trend setter and most people feel comfortable with sklearn API, however, statsmodels follows R-programming API and they are quite different.

In conclusion, It would be great if sklearn provides p-values for linear models.

I am eagerly waiting for the implementation in future versions of sklearn.

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:16 (13 by maintainers)

Top GitHub Comments

8reactions

skeller88commented, Apr 26, 2020

Considering how this request gets repeated pretty frequently, could someone chime in with some clearer justification that could be added to the documentation? Or this feature could find a home in scikit-learn contrib and then could be linked to? I’m unclear on the reasoning for not including p-values.

Reasoning in #6773:

I am afraid that this is out of scope for scikit-learn for two reasons:

First the scope of scikit-learn is really predictive models, whereas the confidence intervals, p-values and related are in the scope of statsmodels.

First, scikit-learn exposes statistical tests via the feature_selection module, and they’re very useful. It’s not like scikit-learn does no stats.

Second, I would argue that p-values are part of model interpretation, and scikit-learn has an inspection module that has model interpretation capabilities. Scikit-learn supports model intepretation for tree-based models with feature_importances_. Granted, linear models expose the coefficients_. But why not add more interpretation capabilities if that’s what people want?

Second, as far as I know, the research on the topic of confidence intervals and p-values in high dimension is still very much open.

This comment implies that p-values should not be added because there’s uncertainty on how when p-values are reliable. The same comment could be made about other parts of the library, such as trusting the feature importances of tree-based models in a multicollinear dataset. From #16860:

I think we need to help people avoid big mistakes, but advice can be hard because the literature is frequently revised and needs to be considered in the context of the user’s task.

I think this philosphy applies to p-values. They are generally accepted as a useful statistical technique when used in the proper way. It’s up to the user to use them appropriately. But people definitely want this feature (myself included 😃 ).

1reaction

GaelVaroquauxcommented, Sep 13, 2020

Even if the CI is not calculated, a strong assumption (identifiability) is necessary in order to uniquely obtain the regression coefficient.

Assumptions differ whether they are about control on the coefficients or on the prediction. The control on the prediction is much more lax then the control on the coefficients.

Should sklearn not be able to do what statsmodels (sm) can do?

No. No need to duplicate functionality across packages.

Top Results From Across the Web

How to Interpret P-values and Coefficients in Regression ...

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model. For the results above,...

15.1 - Logistic Regression | STAT 501

Logistic regression models a relationship between predictor variables and a categorical response variable. For example, we could use logistic regression to ...

Feature selection for Logistic Regression - Cross Validated

First, p-values tell you nothing about the effect of the variable. I can always construct a model with a highly significant feature but ......

What is the level of significance considered in the Logistic ...

scikit-learn's LogisticRegression does not have the functionality by default, its just not implemented, no p-values are computed and output.

Logistic Regression in Python with statsmodels

Have an understanding of Logistic Regression and associated statistical ... True LL-Null: -480.45 Covariance Type: nonrobust LLR p-value: ...