question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: calibration error estimators

See original GitHub issue

While implementations of standard metrics are scattered, this is certainly the case for L^p calibration error estimators.

Why is it important to measure model calibration? When fine-tuning a model with cross-entropy loss (or any other strictly proper loss optimizing both accuracy and calibration) there is no guarantee that your model will turn out well-calibrated. Empirically, large NNs were shown to “overfit” on accuracy, leading to sub-optimal calibration.

Additionally, binned estimators typically require setting some arguments such as the binning scheme (equal-range/equal-mass/…), number of bins, and the p-norm. More advanced settings include debiasing ([1] Kumar et al. 2019) or the proxy used for average bin probability (bin center or bin left/right edge). This library might provide the right standardization and documentation on which arguments are important and how they impact estimation and comparisons.

Without complicating matters from the start, it might already be nice to have a simple calibration error estimator along the lines of ECE ([2] Guo et al. 2017), which (despite some flaws and differing implementations) is commonly used to report top-1 miscalibration. Some “reasonable” defaults are equal-range binning with 15 bins and p-norm 1, to be discussed 😃.

With regards to the implementation, there is a clean way to create a hashmap of unique bin assignments which keeps running averages of the conditional expectation and average bin probabilities. In the future, the hashmap can even be created on the validation set and when evaluating on the test set retrieves values by the hash, resulting in unbiased estimates.

I would very much appreciate a discussion on this so that the community is aligned on the approach. Over the weekend, I will get my hands dirty for a first PR, maybe better to discuss it there 😃 First functional code dump here: https://huggingface.co/spaces/jordyvl/ece

[1] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Jordy-VLcommented, Jun 29, 2022

Continued discussion here: https://huggingface.co/spaces/jordyvl/ece/discussions/1

1reaction
lvwerracommented, Jun 9, 2022

Hi @Jordy-VL, this looks super cool!

The launch_module function is mainly a helper to setup the gradio app without much customization and serves no functional purpose when loading the metric with evaluate. So you can write your app.py with Gradio and setup your own widget like you want. I think It would be great to keep the structure similar to the other metrics so it’s easily recognizable. In particular parsing the README and metric description, see here: https://github.com/huggingface/evaluate/blob/76218cf1ea5e757feaaabe96132391a5005aa84c/src/evaluate/utils/gradio.py#L119-L121

It could also be nice to add examples to the Gradio interface so you get a nice default plot.

Also, I don’t think it is not necessary to merge your metric into evaluate - we could keep it where it is under your name. That way it stays connected to you and is a nice example of a community metric. What do you think?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[2012.08668] Mitigating Bias in Calibration Error Estimation
We propose a framework through which we can compute the bias of a particular estimator for an evaluation data set of a given...
Read more >
Estimating Expected Calibration Errors | SpringerLink
Calibration characterizes how much a model is able to output scores corresponding to actual posterior probabilities. The first and simplest ...
Read more >
Mitigating Bias in Calibration Error Estimation
To obtain an estimate of the calibration error, or ECE1, the standard procedure partitions the model confidence scores into bins and compares the...
Read more >
Estimating Expected Calibration Errors - YouTube
Nicolas Posocco presents his work on the empirical evaluation of calibration metrics in the context of classification.
Read more >
MITIGATING BIAS IN CALIBRATION ERROR ... - OpenReview
We propose a simple alternative calibration error metric, ECESWEEP, in which the number of bins is chosen to be as large as possible...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found