Feature: calibration error estimators
See original GitHub issueWhile implementations of standard metrics are scattered, this is certainly the case for L^p calibration error estimators.
Why is it important to measure model calibration? When fine-tuning a model with cross-entropy loss (or any other strictly proper loss optimizing both accuracy and calibration) there is no guarantee that your model will turn out well-calibrated. Empirically, large NNs were shown to “overfit” on accuracy, leading to sub-optimal calibration.
Additionally, binned estimators typically require setting some arguments such as the binning scheme (equal-range/equal-mass/…), number of bins, and the p-norm. More advanced settings include debiasing ([1] Kumar et al. 2019) or the proxy used for average bin probability (bin center or bin left/right edge). This library might provide the right standardization and documentation on which arguments are important and how they impact estimation and comparisons.
Without complicating matters from the start, it might already be nice to have a simple calibration error estimator along the lines of ECE
([2] Guo et al. 2017), which (despite some flaws and differing implementations) is commonly used to report top-1 miscalibration. Some “reasonable” defaults are equal-range binning with 15 bins and p-norm 1, to be discussed 😃.
With regards to the implementation, there is a clean way to create a hashmap of unique bin assignments which keeps running averages of the conditional expectation and average bin probabilities. In the future, the hashmap can even be created on the validation set and when evaluating on the test set retrieves values by the hash, resulting in unbiased estimates.
I would very much appreciate a discussion on this so that the community is aligned on the approach. Over the weekend, I will get my hands dirty for a first PR, maybe better to discuss it there 😃 First functional code dump here: https://huggingface.co/spaces/jordyvl/ece
[1] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32. [2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top GitHub Comments
Continued discussion here:
https://huggingface.co/spaces/jordyvl/ece/discussions/1
Hi @Jordy-VL, this looks super cool!
The
launch_module
function is mainly a helper to setup the gradio app without much customization and serves no functional purpose when loading the metric withevaluate
. So you can write yourapp.py
with Gradio and setup your own widget like you want. I think It would be great to keep the structure similar to the other metrics so it’s easily recognizable. In particular parsing the README and metric description, see here: https://github.com/huggingface/evaluate/blob/76218cf1ea5e757feaaabe96132391a5005aa84c/src/evaluate/utils/gradio.py#L119-L121It could also be nice to add examples to the Gradio interface so you get a nice default plot.
Also, I don’t think it is not necessary to merge your metric into
evaluate
- we could keep it where it is under your name. That way it stays connected to you and is a nice example of a community metric. What do you think?