Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: calibration error estimators

See original GitHub issue

While implementations of standard metrics are scattered, this is certainly the case for L^p calibration error estimators.

Why is it important to measure model calibration? When fine-tuning a model with cross-entropy loss (or any other strictly proper loss optimizing both accuracy and calibration) there is no guarantee that your model will turn out well-calibrated. Empirically, large NNs were shown to “overfit” on accuracy, leading to sub-optimal calibration.

Additionally, binned estimators typically require setting some arguments such as the binning scheme (equal-range/equal-mass/…), number of bins, and the p-norm. More advanced settings include debiasing ([1] Kumar et al. 2019) or the proxy used for average bin probability (bin center or bin left/right edge). This library might provide the right standardization and documentation on which arguments are important and how they impact estimation and comparisons.

Without complicating matters from the start, it might already be nice to have a simple calibration error estimator along the lines of ECE ([2] Guo et al. 2017), which (despite some flaws and differing implementations) is commonly used to report top-1 miscalibration. Some “reasonable” defaults are equal-range binning with 15 bins and p-norm 1, to be discussed 😃.

With regards to the implementation, there is a clean way to create a hashmap of unique bin assignments which keeps running averages of the conditional expectation and average bin probabilities. In the future, the hashmap can even be created on the validation set and when evaluating on the test set retrieves values by the hash, resulting in unbiased estimates.

I would very much appreciate a discussion on this so that the community is aligned on the approach. Over the weekend, I will get my hands dirty for a first PR, maybe better to discuss it there 😃 First functional code dump here: https://huggingface.co/spaces/jordyvl/ece

[1] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Jordy-VLcommented, Jun 29, 2022

Continued discussion here: https://huggingface.co/spaces/jordyvl/ece/discussions/1

1reaction

lvwerracommented, Jun 9, 2022

Hi @Jordy-VL, this looks super cool!

The launch_module function is mainly a helper to setup the gradio app without much customization and serves no functional purpose when loading the metric with evaluate. So you can write your app.py with Gradio and setup your own widget like you want. I think It would be great to keep the structure similar to the other metrics so it’s easily recognizable. In particular parsing the README and metric description, see here: https://github.com/huggingface/evaluate/blob/76218cf1ea5e757feaaabe96132391a5005aa84c/src/evaluate/utils/gradio.py#L119-L121

It could also be nice to add examples to the Gradio interface so you get a nice default plot.

Also, I don’t think it is not necessary to merge your metric into evaluate - we could keep it where it is under your name. That way it stays connected to you and is a nice example of a community metric. What do you think?