Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add `Evaluator` class to easily evaluate a combination of (model, dataset, metric)

See original GitHub issue

Similar to the Trainer class in transformers it would be nice to easily evaluate a model on a dataset given a metric. We could use the Trainer but it comes with a lot of unused extra stuff and is transformers centric. Alternatively we could build an Evaluator as follows:

from evaluate import Evaluator
from evaluate import load_metric
from dataset import load_dataset
from transformers import pipeline

metric = load_metric("bleu")
dataset = load_dataset("wmt19", language_pair=("de", "en"))
pipe = pipeline("translation", model="opus-mt-de-en"))

# WMT specific transform
dataset = dataset.map(lambda x: {"source": x["translation"]["de"], "target": x["translation"]["en"]}) 

evaluator = Evaluator(
    model=pipe,
    dataset=dataset,
    metric=metric,
    dataset_mapping={"model_input": "source", "references": "target"}
)

evaluator.evaluate()
>>> {"bleu": 12.4}

The dataset_mapping maps the dataset columns to inputs for the model and metric. Using the pipeline API as the standard for the Evaluator this could easily be extended to any other framework. The user would just need to setup a pipeline class with the main requirement being that inputs and outputs follow the same format and that the class has implemented a __call__ method.

The advantage of starting with the pipeline API is that in transformers it already implements a lot of quality of life functionality such as batching and GPU. Also it abstracts away the pre/post-processing.

In #16 it is mentioned that statistical significance testing would be a desired feature. The above example could be extended to enable this:

evaluator.evalute(n_runs=42)
>>> [{"bleu": 12.4}, {"bleu": 8.3}, ...]

Where under the hood the random seed is changed between the runs.

cc @douwekiela @osanseviero @nrajani @lhoestq

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:12 (12 by maintainers)

Top GitHub Comments

1reaction

fxmartycommented, May 20, 2022

Hello, I would be a user of a similar feature in https://github.com/huggingface/autoquantize (API to launch evaluation of Optimum quantized models vs. transformers baseline). For the moment I wrote my own evaluation scripts using pipelines.

Just wanted to point out that using pipelines for evaluation is not out-of-the-box, see for example https://github.com/huggingface/transformers/issues/17305 and https://github.com/huggingface/transformers/issues/17139 . At least I haven’t found a way to make it work task-independent. I would be glad if somebody is working on this to discuss.

My approach is the following: https://github.com/fxmarty/optimum/tree/runs-only/optimum/utils/preprocessing

See as well https://github.com/huggingface/optimum/pull/194 .

ping @mfuntowicz as well

1reaction

lewtuncommented, May 16, 2022

Good point about custom datasets and I like your idea about explicitly showing the expected inputs / outputs in the repr!

For Hub datasets, you don’t have to download any files as you can ping the datasets API directly, e.g.

import requests
from typing import Dict, Union

def get_metadata(dataset_name: str) -> Union[Dict, None]:
    data = requests.get(f"https://huggingface.co/api/datasets/{dataset_name}").json()
    if data["cardData"] is not None and "train-eval-index" in data["cardData"].keys():
        return data["cardData"]["train-eval-index"]
    else:
        return None

metadata = get_metadata("imdb")

The only “problem” is that we’ve defined our column mappings to align with AutoTrain, and that taxonomy might not be as convenient / flexible for what you’re trying to do.

Top Results From Across the Web

Using the `evaluator` - Hugging Face

The Evaluator classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all ......

Training and evaluation with the built-in methods - TensorFlow

This guide covers training, evaluation, and prediction (inference) models when using built-in APIs for training & validation (such as ...

Model Evaluation in Scikit-learn - Towards Data Science

Model Evaluation in Scikit-learn. A tutorial on how to calculate the most common metrics for regression and classification using scikit-learn.

Evaluating models | AutoML Tables - Google Cloud

Evaluation metrics for classification models · Confusion matrix: The confusion matrix helps you understand where misclassifications occur (which classes get " ...

Tour of Evaluation Metrics for Imbalanced Classification

This typically involves training a model on a dataset, using the model to ... Unlike standard evaluation metrics that treat all classes as ......