Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation/Metrics approach

See original GitHub issue

This has been discussed a bit in other issues, but I wanted to make a dedicated issue for us to discuss this as I think it’s very important we get this right.

Background

Some previous discussions here and here.

Throughout, I am going to use the motivating example of training a supervised model where you periodically want to compute some metrics against a validation set.

Current Setup

Currently, in order to accomplish this, you need to do the following:

create an Evaluator
register the Evaluate handler to run the Evaluator on the validation set and store the predictions in the history
add another event handler to actually use this history to compute the metrics you care about
log/plot these metrics however you choose

In code, this looks something like this:

model = ...
validation_loader = ...
trainer = ...
evaluator = create_supervised_evaluator(model, cuda=True)
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(evaluator, validation_loader, epoch_interval=1))
@trainer.on(Events.EPOCH_COMPLETED)
def log(engine):
    print(engine.current_epoch, categorical_accuracy(evaluator.history))

Pros

keeps library code cleanly separated with minimal implicit dependencies
user doesn’t have to write much code

Cons

can be confusing what happens where
- we have both an Evaluator and an Evaluate and yet neither one computes any sort of metrics
there are a lot of ways this could go wrong
- you have to understand the contract between what gets stored in the Evaluator’s history and the metrics functions
- you have to make sure you attach the Evaluate handler and any logging handlers to the same event

Goals

Evaluating a model is gonna be something that (essentially) everyone does so I think we need to have a good story here. Imo, we should make the supervised case be super easy while still making it possible for non-supervised cases. That being said, I think we want to accomplish this without removing flexibility and without adding a ton of code.

Ideas

Working backward from what I would like the api to be, it might be nice if you could just do something like this:

model = ...
validation_loader = ...
trainer = ...
@trainer.on(Events.EPOCH_COMPLETED)
def evaluate(engine):
  results = evaluate(model, {'acc': categorical_accuracy})
  # do something with those results

It’d be even nicer if I could do something like this:

model = ...
validation_loader = ...
trainer = ...
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(model, {'acc': categorical_accuracy}))

But without making assumptions about how users want to plot/log their evaluation results, this isn’t possible.

What do you all think? Anything here you take issue with? Any ideas on how we can best accomplish this? Do we need to make plotting/logging part of this discussion as well?

Issue Analytics

State:
Created 6 years ago
Reactions:3
Comments:8 (6 by maintainers)

Top GitHub Comments

3reactions

alykhantejanicommented, Feb 14, 2018

So I came up with something pretty similar to @jasonkriss’s:

def run(train_batch_size, val_batch_size, epochs, lr, momentum, log_interval, logger):
    train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
    model = Net()
    optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
    trainer = create_supervised_trainer(model, optimizer, nn.NLLLoss())
    evaluator = create_evaluator(model, metrics=[categorical_accuracy, nn.NLLLoss()], 
                          mean=[True, True])

    @trainer.on(Events.ITERATION_COMPLETED)
    def log_training_loss(trainer):
        if trainer.current_iteration % log_interval == 0:
            log_simple_moving_average(trainer, window_size=10, logger=logger)


    @trainer.on(Events.EPOCH_COMPLETED)
    def log_validation_results(trainer):
        avg_accuracy, avg_loss = evaluator.run(val_loader)
        logger("Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
               .format(trainer.current_epoch, avg_accuracy, avg_loss))

    trainer.run(train_loader, max_epochs=epochs)

The main bit being:

        evaluator = create_evaluator(model, metrics=[categorical_accuracy, nn.NLLLoss()], 
                              mean=[True, True])
        avg_accuracy, avg_loss = evaluator.run(val_loader)

where the Evaluator takes in a few metrics and wether the mean should be computed (probably better to expand this to some form of generic reduction, but that also makes the interface messy)

To implement this the create_evaluator function would create an Evaluator with a evaluation/inference function that passes the batch through the model and then through each of the metrics and then stores these results to history. The reduction operation could then happen after the data is consumed.

This leads to a few questions/decisions:

The interface to ignite.metrics would have to support a single (prediction, ground_truth) pair. It should probably also support doing this over a list of pairs (for people who want to get rolling window metrics during training)
If we keep the Evaluator object as an instance of Engine, should the object take care of the reduction operations or should this be done via handlers (which we can add to the evaluator in the factory functions)? The passing of data through metrics and storing to history must be done in the inference/evaluation function and takes a different form for supervised, un-supervised, multi-task learning so I don’t think we can make this general (we can however provide factory functions) .

2reactions

vfdev-5commented, Jun 29, 2019

Hey @henrique , yes, now metrics history is not stored by engine’s state. I would say the best way to keep it available is to log it : print, tensorboardx, write into file or any long-term living variable to plot, …). Take a look at our notebook examples:

log to TensorboardX: EfficientNet_Cifar100_finetuning or CycleGAN
log to a variable to plot with matplotlib: FashionMNIST

HTH

Top Results From Across the Web

Evaluation Metrics - Towards Data Science

The most commonly used evaluation method is Accuracy. It is the ratio of the correct predicted values over the total predicted values.

Metrics-based Approach |

This approach is above all focused on measuring and quantifying outputs and outcomes. It is heavily focused on hard variables (something which can...

Evaluation Metrics Machine Learning - Analytics Vidhya

Learn different model evaluation metrics for machine learning like cross validation, confusion matrix, AUC-ROC, RMSE, Gini coefficients and ...

Evaluation Metric - an overview | ScienceDirect Topics

One evaluation metric is based on the number of words in the speech that are correctly recognized. Speech recognition systems experience three types...

Performance Evaluation Metrics and Approaches for Target ...

PE is referred to the assessment and the evaluation of various performance metrics of a system [13,14], whose significance lies in providing ...