Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC classifiers trained by minimizing the Brier loss

See original GitHub issue

At the moment, our probablistic classifiers (e.g. logistic regression and gradient boosted trees) optimize the log loss, typically after taking a sigmoid or softmax inverse link function (typically part of the Cython loss).

However the log-loss is not the only proper-scoring rule to fit estimators of the expected conditional class probabilities, in particular the Brier loss is also a proper scoring rule and has the practical advantage of being upper bounded which should limit the impact of mislabeled examples in the training set.

Would it make sense to publicly such estimators in scikit-learn?

Note that for linear models, fitting the Brier loss is not equivalent to our RidgeClassifier because the latter does not take the softmax of its raw predictions before computing the square loss, hence it does not offer a well defined estimation of the conditional class probabilities.

Minimizing the Brier loss of a softmax linear model is no longer a convex optimization problem but I doubt that this would prevent a newton solver with a robust line search to converge in practice.

Issue Analytics

State:
Created a year ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

lorentzenchrcommented, Jun 3, 2022

Pretty much any GLM literature (score equation for general links and canonical links is the key):

McCullagh & Nelder, Generalized Linear Models (maybe still the best book about GLMs!) Chapters 2.2.3, 4.3.1
https://dx.doi.org/10.2139/ssrn.3822407 Chapter 5.1.5 for the balance property of canonical links which implies good calibration.

For probabilistic classification in particular (no GLM context)

Buja, Stuerzle & Shen, Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications http://www-stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdf

The last point about efficiency (for fitting) is a property of Maximum Likelihood Estimation Theory (achieving the Cramer-Rao lower bound asymptotically).

1reaction

GaelVaroquauxcommented, Jun 2, 2022

~~That’s RidgeClassifier, isn’t it?~~ Ooops, didn’t read well.

I’d say: is there a good literature documenting practice (and theory) showing benefit of such classifiers. If not, I wouldn’t prioritize this: we have a lot of things on our plate already.