question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Defining and standardizing metric input structures

See original GitHub issue

Documenting the types of inputs and data structures used for each metric

accuracy

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

or if "multilabel" mode:

predictions= Value("int32")
references = Value("int32")

bertscore

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

bleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Value("string", id="token"), id="sequence"), id="references"

bleurt

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

cer

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

chrf

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

code_eval

predictions = Sequence(Value("string"))
references = Value("string")

comet

sources = Value("string", id="sequence")
predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

competition_math

predictions = Value("string")
references = Value("string")

coval

predictions = Sequence(Value("string"))
references = Sequence(Value("string"))

N.B. The sentences have to be in CoNLL format, which may be tricky to handle in some cases

cuad

"predictions": {
    "id": Value("string"),
    "prediction_text": Sequence(Value("string")),
}
  "references": {
      "id": Value("string"),
      "answers": Sequence(
          {
              "text": Value("string"),
              "answer_start": Value("int32"),
          }
      ),
  },
}

exact_match

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

f1

predictions = Sequence(Value("int32")
references = Sequence(Value("int32"))

frugalscore

references = Value("string")
predictions = Value("string")

gleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Sequence(Value("string", id="token"), id="sequence"), id="references")

glue

predictions = Value("int64" if self.config_name != "stsb" else "float32")
references = Value("int64" if self.config_name != "stsb" else "float32")

The type of input depends on the GLUE subset used.

google_bleu

predictions = Sequence(Value("string", id="token"), id="sequence")
references = Sequence(Sequence(Value("string", id="token"), id="sequence"), id="references")

indic_glue

predictions = Value("int64") if self.config_name != "cvit-mkb-clsr" else Sequence(Value("float32"))
references = Value("int64") if self.config_name != "cvit-mkb-clsr" else Sequence(Value("float32"))

mae

predictions = Value("float")
references = Value("float")

or if multilist:

predictions = Sequence(Value("float"))
references = Sequence(Value("float"))

mahalanobis

"X": Sequence(Value("float", id="sequence"), id="X")
reference_distribution = np.array(reference_distribution)

N.B. the names for references and predictions are different here – maybe we should standardize? wdyt @lhoestq

matthews_correlation

predictions = Value("int32")
references = Value("int32")

mauve

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

mean_iou

predictions = Sequence(Sequence(Value("uint16")))
references = Sequence(Sequence(Value("uint16")))

What’s a unit16? unicode? this is the only metric with a unicode restriction (so far).

meteor

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

mse

predictions = Value("float")
references = Value("float")

or if multilist:

predictions = Sequence(Value("float"))
references = Sequence(Value("float")),

pearsonr

references = Value("float")
predictions = Value("float")

perplexity

input_texts = Value("string")

precision

predictions = Value("int32")
references = Value("int32")

or if multilist:

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

recall

predictions = Value("int32")
references = Value("int32")

or if multilist:

predictions = Sequence(Value("int32"))
references = Sequence(Value("int32"))

rouge

predictions = Value("string", id="sequence")
references = Value("string", id="sequence")

sacrebleu

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

sari

sources = Value("string", id="sequence")
predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

seqeval

predictions = Sequence(Value("string", id="label"), id="sequence")
references = Sequence(Value("string", id="label"), id="sequence")

N.B. both predictions and references are in IOB format

spearmanr

predictions = Value("float")
references = Value("float")

squad

predictions = {"id": Value("string"), "prediction_text": Value("string")}
"references": {
"id": Value("string"),
"answers": features.Sequence(
    {
        "text": Value("string"),
        "answer_start": Value("int32"),
    }
)

squad_v2

"predictions": {
    "id": Value("string"),
    "prediction_text": Value("string"),
    "no_answer_probability": Value("float32"),
}
"references": {
    "id": Value("string"),
    "answers": features.Sequence(
      {"text": Value("string"), "answer_start": Value("int32")}
                        ),
                    }

N.B. SQuAD and SQuAD v2. formats differ in the fact that v2 has the 'no_answer_probability' tag in predictions.

super_glue

if self.config_name == "record":
        return {
            "predictions": {
                "idx": {
                    "passage": Value("int64"),
                    "query": Value("int64"),
                },
                "prediction_text": Value("string"),
            },
            "references": {
                "idx": {
                    "passage": Value("int64"),
                    "query": Value("int64"),
                },
                "answers": Sequence(datasets.Value("string")),
            },
        }
    elif self.config_name == "multirc":
        return {
            "predictions": {
                "idx": {
                    "answer": Value("int64"),
                    "paragraph": Value("int64"),
                    "question": Value("int64"),
                },
                "prediction": Value("int64"),
            },
            "references": Value("int64"),
        }
    else:
        return {
            "predictions": Value("int64"),
            "references": Value("int64"),
        }

ter

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

wer

predictions = Value("string", id="sequence"),
references = Value("string", id="sequence")

wiki_split

predictions = Value("string", id="sequence")
references = Sequence(Value("string", id="sequence"), id="references")

xnli

predictions = Value("int64" if self.config_name != "sts-b" else "float32")
references = Value("int64" if self.config_name != "sts-b" else "float32")

xtreme_s

pred_type = "int64" if self.config_name in ["fleurs-lang_id", "minds14"] else "string"
predictions = Value(pred_type)
references = Value(pred_type)

N.B. the input depends on the XTREME-S dataset selected

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
lvwerracommented, May 5, 2022

@lhoestq indeed the data is cast there, however, when deactivating it is still cast later on by pyarrow.array (see docs). The main issue is string types as everything can be cast to a string.

Ideally we would have a mechanism that checks all types but this could be quite extensive. So maybe to start we could just check if something that should be a string is really a string. What do you think?

1reaction
lhoestqcommented, May 4, 2022

Indeed currently it tries to cast the type here:

https://github.com/huggingface/evaluate/blob/df3d20712df202b586f73cf45a66b65652e45d5b/src/evaluate/metric.py#L464

You can try removing this line, it should fix this issue 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quick Tips for Defining Business Performance Metrics
How to create the right metrics and standardizing their meaning, including a useful template to collect and analyze your business performance metrics.
Read more >
METRIC DESIGN GUIDE - GSA.gov
Pub. L. 100-418 designated the metric system as the preferred system of weights and measures for U.S. trade and commerce. This law also...
Read more >
C Metric Unit Standardization - Oracle Help Center
Unit Category Unit Code Unit Display Unit NLS ID BOOLEAN BOOLEAN boolean EM_SYS_STANDARD_BOOLEAN_BOOLEAN COUNT NA n/a EM_SYS_STANDARD_COUNT_NA DATA_SIZE BLOCK blocks EM_SYS_STANDARD_DATASIZE_BLOCK
Read more >
Standardizing Performance Metrics for Building-Level ... - NREL
The performance metrics developed here, along with well-defined boundaries, can be used to compare commercial and residential buildings. In the ...
Read more >
A guide to standardized business processes, data, and ...
Standardization is any process used to develop and implement metrics (i.e., “standards”) that specify essential characteristics of something whose control and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found