Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: compose multiple metrics into single object

See original GitHub issue

Often models are evaluated on multiple metrics in a project. E.g. a classification project might always want to report the Accuracy, Precision, Recall, and F1 score. In scikit-learn one use the classification report for that which is widely used. This takes this a step further and allows the user to freely compose metrics. Similar to a DatasetDict one could use the MetricSuite like a Metric object.

metrics_suite = MetricsSuite(
     {
        "accuray": load_metric("accuracy"),
        "recall": load_metric("recall")
     }
)

metrics_suite = MetricsSuite(
     {
        "bleu": load_metric("bleu"),
        "rouge": load_metric("rouge"),
        "perplexity": load_metric("perplexity")
     }
)

metrics_suite.add(predictions, references)
metrics_suite.compute()
>>> {"bleu": bleu_result_dict, "rouge": roughe_result_dict, "perplexity": perplexity_result_dict}

Alternatively, we could also flatten the return dict or have it as an option. We could also add a summary option that defines how an overall result is calculated. E.g. summary="average" averages all the metrics into a summary metric or a custom function with summary=lambda x: x["bleu"]**2 + 0.5*["rouge"]+2. This would allow to create simple, composed metrics without the needing to define a new metric (e.g. for a custom benchmark).

cc @douwekiela @lewtun

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

4reactions

lhoestqcommented, Apr 8, 2022

From our discussion, here are some API ideas:

I. For metrics with the same inputs

>>> from evaluate import load_metrics
>>> 
>>> metric = load_metrics(["bleu", "rouge"])
>>> metric.compute(predictions=..., references=...)
{"rouge1": ..., "rouge2": ... , ... , "bleu": ...}

PS: it fails if you mix metrics with incompatible inputs like “accuracy” and “bleu”. We would need a function for users to redefine their inputs if we want to have this in the end:

>>> metric.compute(text_predictions=..., text_references=..., labels_predictions=..., labels_references=...)

II. If input names don’t match, it’s ok

>>> from evaluate import load_metrics
>>>
>>> metric = load_metrics(["bleu", "perplexity"])
>>> metric.compute(predictions=..., references=..., input_texts=...)  # input_texts is the perplexity input
{"bleu": ..., ..., "perplexity": ...}

PS 2: would it be nice to rename the input of perplexity to be one of references or predictions ?

III. If you need more control over the metric, you can load them separately

>>> from evaluate import load_metric, combine_metrics
>>> 
>>> bleu = load_metric("bleu")
>>> bleurt = load_metric("bleurt", "bleurt-large-512")
>>> metric = combine_metrics([bleu, bleurt])

IV. Aggregate metrics

>>> from evaluate import load_metrics
>>> 
>>> metric = load_metrics(["bleu", "rouge"]).add_mean(["rouge1", "rouge2", "bleu"], weights=[...])
>>> metric.compute(predictions=..., references=...)
{"rouge1": ..., "rouge2": ... , ... ,"mean_rouge1_rouge2_bleu": ...}

PS 3: we could also have a function .apply() if users want to define their own aggregation functions.

PS 4: we could also allow users to discard and/or rename output values

Feel free to comment/edit this if you have other ideas 😃

1reaction

lhoestqcommented, Apr 14, 2022

I haven’t seen any other library or project approach mix of metrics using an API like this, I feel like it would be hard to have an intuition that it must be used this way. Therefore I would be in favor of having something more explicit.

Anyway it’s ok to not focus on this case right now, I don’t think it would have much usage and users can still handle two separate metrics separately

Top Results From Across the Web

Compose multiple objects | Cloud Storage

Stay organized with collections Save and categorize content based on your preferences. Compose multiple objects into a single object in a Cloud Storage...

Intrinsic measurements in Compose layouts

When creating a custom Layout or layout modifier, intrinsic measurements are calculated automatically based on approximations. Therefore, the calculations might ...

Combine metric charts - VMware Technology Network VMTN

Double click metrics here to add their chart to the right hand pane, then navigate to another object in the tree and you...

GraphQL query best practices

Apollo Studio provides helpful operation-level metrics, which require named operations. Use GraphQL variables to provide arguments. These two queries can both ...

HOTA: A Higher Order Metric for Evaluating Multi-object ...

They propose a set of five separate simple metrics, one for each error type, however they make no effort to combine these into...