Add `Evaluator` class to easily evaluate a combination of (model, dataset, metric)
See original GitHub issueSimilar to the Trainer
class in transformers
it would be nice to easily evaluate a model on a dataset given a metric. We could use the Trainer
but it comes with a lot of unused extra stuff and is transformers
centric. Alternatively we could build an Evaluator
as follows:
from evaluate import Evaluator
from evaluate import load_metric
from dataset import load_dataset
from transformers import pipeline
metric = load_metric("bleu")
dataset = load_dataset("wmt19", language_pair=("de", "en"))
pipe = pipeline("translation", model="opus-mt-de-en"))
# WMT specific transform
dataset = dataset.map(lambda x: {"source": x["translation"]["de"], "target": x["translation"]["en"]})
evaluator = Evaluator(
model=pipe,
dataset=dataset,
metric=metric,
dataset_mapping={"model_input": "source", "references": "target"}
)
evaluator.evaluate()
>>> {"bleu": 12.4}
The dataset_mapping
maps the dataset
columns to inputs for the model and metric. Using the pipeline
API as the standard for the Evaluator
this could easily be extended to any other framework. The user would just need to setup a pipeline
class with the main requirement being that inputs and outputs follow the same format and that the class has implemented a __call__
method.
The advantage of starting with the pipeline
API is that in transformers
it already implements a lot of quality of life functionality such as batching and GPU. Also it abstracts away the pre/post-processing.
In #16 it is mentioned that statistical significance testing would be a desired feature. The above example could be extended to enable this:
evaluator.evalute(n_runs=42)
>>> [{"bleu": 12.4}, {"bleu": 8.3}, ...]
Where under the hood the random seed is changed between the runs.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:12 (12 by maintainers)
Top GitHub Comments
Hello, I would be a user of a similar feature in https://github.com/huggingface/autoquantize (API to launch evaluation of Optimum quantized models vs. transformers baseline). For the moment I wrote my own evaluation scripts using pipelines.
Just wanted to point out that using pipelines for evaluation is not out-of-the-box, see for example https://github.com/huggingface/transformers/issues/17305 and https://github.com/huggingface/transformers/issues/17139 . At least I haven’t found a way to make it work task-independent. I would be glad if somebody is working on this to discuss.
My approach is the following: https://github.com/fxmarty/optimum/tree/runs-only/optimum/utils/preprocessing
See as well https://github.com/huggingface/optimum/pull/194 .
ping @mfuntowicz as well
Good point about custom datasets and I like your idea about explicitly showing the expected inputs / outputs in the
repr
!For Hub datasets, you don’t have to download any files as you can ping the
datasets
API directly, e.g.The only “problem” is that we’ve defined our column mappings to align with AutoTrain, and that taxonomy might not be as convenient / flexible for what you’re trying to do.