Feature: compose multiple metrics into single object
See original GitHub issueOften models are evaluated on multiple metrics in a project. E.g. a classification project might always want to report the Accuracy, Precision, Recall, and F1 score. In scikit-learn
one use the classification report for that which is widely used. This takes this a step further and allows the user to freely compose metrics. Similar to a DatasetDict
one could use the MetricSuite
like a Metric
object.
metrics_suite = MetricsSuite(
{
"accuray": load_metric("accuracy"),
"recall": load_metric("recall")
}
)
metrics_suite = MetricsSuite(
{
"bleu": load_metric("bleu"),
"rouge": load_metric("rouge"),
"perplexity": load_metric("perplexity")
}
)
metrics_suite.add(predictions, references)
metrics_suite.compute()
>>> {"bleu": bleu_result_dict, "rouge": roughe_result_dict, "perplexity": perplexity_result_dict}
Alternatively, we could also flatten the return dict or have it as an option. We could also add a summary
option that defines how an overall result is calculated. E.g. summary="average"
averages all the metrics into a summary metric or a custom function with summary=lambda x: x["bleu"]**2 + 0.5*["rouge"]+2
. This would allow to create simple, composed metrics without the needing to define a new metric (e.g. for a custom benchmark).
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top GitHub Comments
From our discussion, here are some API ideas:
I. For metrics with the same inputs
PS: it fails if you mix metrics with incompatible inputs like “accuracy” and “bleu”. We would need a function for users to redefine their inputs if we want to have this in the end:
II. If input names don’t match, it’s ok
PS 2: would it be nice to rename the input of perplexity to be one of references or predictions ?
III. If you need more control over the metric, you can load them separately
IV. Aggregate metrics
PS 3: we could also have a function
.apply()
if users want to define their own aggregation functions.PS 4: we could also allow users to discard and/or rename output values
Feel free to comment/edit this if you have other ideas 😃
I haven’t seen any other library or project approach mix of metrics using an API like this, I feel like it would be hard to have an intuition that it must be used this way. Therefore I would be in favor of having something more explicit.
Anyway it’s ok to not focus on this case right now, I don’t think it would have much usage and users can still handle two separate metrics separately