question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: compose multiple metrics into single object

See original GitHub issue

Often models are evaluated on multiple metrics in a project. E.g. a classification project might always want to report the Accuracy, Precision, Recall, and F1 score. In scikit-learn one use the classification report for that which is widely used. This takes this a step further and allows the user to freely compose metrics. Similar to a DatasetDict one could use the MetricSuite like a Metric object.

metrics_suite = MetricsSuite(
     {
        "accuray": load_metric("accuracy"),
        "recall": load_metric("recall")
     }
)

metrics_suite = MetricsSuite(
     {
        "bleu": load_metric("bleu"),
        "rouge": load_metric("rouge"),
        "perplexity": load_metric("perplexity")
     }
)

metrics_suite.add(predictions, references)
metrics_suite.compute()
>>> {"bleu": bleu_result_dict, "rouge": roughe_result_dict, "perplexity": perplexity_result_dict}

Alternatively, we could also flatten the return dict or have it as an option. We could also add a summary option that defines how an overall result is calculated. E.g. summary="average" averages all the metrics into a summary metric or a custom function with summary=lambda x: x["bleu"]**2 + 0.5*["rouge"]+2. This would allow to create simple, composed metrics without the needing to define a new metric (e.g. for a custom benchmark).

cc @douwekiela @lewtun

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
lhoestqcommented, Apr 8, 2022

From our discussion, here are some API ideas:

I. For metrics with the same inputs

>>> from evaluate import load_metrics
>>> 
>>> metric = load_metrics(["bleu", "rouge"])
>>> metric.compute(predictions=..., references=...)
{"rouge1": ..., "rouge2": ... , ... , "bleu": ...}

PS: it fails if you mix metrics with incompatible inputs like “accuracy” and “bleu”. We would need a function for users to redefine their inputs if we want to have this in the end:

>>> metric.compute(text_predictions=..., text_references=..., labels_predictions=..., labels_references=...)

II. If input names don’t match, it’s ok

>>> from evaluate import load_metrics
>>>
>>> metric = load_metrics(["bleu", "perplexity"])
>>> metric.compute(predictions=..., references=..., input_texts=...)  # input_texts is the perplexity input
{"bleu": ..., ..., "perplexity": ...}

PS 2: would it be nice to rename the input of perplexity to be one of references or predictions ?

III. If you need more control over the metric, you can load them separately

>>> from evaluate import load_metric, combine_metrics
>>> 
>>> bleu = load_metric("bleu")
>>> bleurt = load_metric("bleurt", "bleurt-large-512")
>>> metric = combine_metrics([bleu, bleurt])

IV. Aggregate metrics

>>> from evaluate import load_metrics
>>> 
>>> metric = load_metrics(["bleu", "rouge"]).add_mean(["rouge1", "rouge2", "bleu"], weights=[...])
>>> metric.compute(predictions=..., references=...)
{"rouge1": ..., "rouge2": ... , ... ,"mean_rouge1_rouge2_bleu": ...}

PS 3: we could also have a function .apply() if users want to define their own aggregation functions.

PS 4: we could also allow users to discard and/or rename output values

Feel free to comment/edit this if you have other ideas 😃

1reaction
lhoestqcommented, Apr 14, 2022

I haven’t seen any other library or project approach mix of metrics using an API like this, I feel like it would be hard to have an intuition that it must be used this way. Therefore I would be in favor of having something more explicit.

Anyway it’s ok to not focus on this case right now, I don’t think it would have much usage and users can still handle two separate metrics separately

Read more comments on GitHub >

github_iconTop Results From Across the Web

Compose multiple objects | Cloud Storage
Stay organized with collections Save and categorize content based on your preferences. Compose multiple objects into a single object in a Cloud Storage...
Read more >
Intrinsic measurements in Compose layouts
When creating a custom Layout or layout modifier, intrinsic measurements are calculated automatically based on approximations. Therefore, the calculations might ...
Read more >
Combine metric charts - VMware Technology Network VMTN
Double click metrics here to add their chart to the right hand pane, then navigate to another object in the tree and you...
Read more >
GraphQL query best practices
Apollo Studio provides helpful operation-level metrics, which require named operations. Use GraphQL variables to provide arguments. These two queries can both ...
Read more >
HOTA: A Higher Order Metric for Evaluating Multi-object ...
They propose a set of five separate simple metrics, one for each error type, however they make no effort to combine these into...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found