question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation/Metrics approach

See original GitHub issue

This has been discussed a bit in other issues, but I wanted to make a dedicated issue for us to discuss this as I think it’s very important we get this right.

Background

Some previous discussions here and here.

Throughout, I am going to use the motivating example of training a supervised model where you periodically want to compute some metrics against a validation set.

Current Setup

Currently, in order to accomplish this, you need to do the following:

  1. create an Evaluator
  2. register the Evaluate handler to run the Evaluator on the validation set and store the predictions in the history
  3. add another event handler to actually use this history to compute the metrics you care about
  4. log/plot these metrics however you choose

In code, this looks something like this:

model = ...
validation_loader = ...
trainer = ...
evaluator = create_supervised_evaluator(model, cuda=True)
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(evaluator, validation_loader, epoch_interval=1))
@trainer.on(Events.EPOCH_COMPLETED)
def log(engine):
    print(engine.current_epoch, categorical_accuracy(evaluator.history))

Pros

  • keeps library code cleanly separated with minimal implicit dependencies
  • user doesn’t have to write much code

Cons

  • can be confusing what happens where
    • we have both an Evaluator and an Evaluate and yet neither one computes any sort of metrics
  • there are a lot of ways this could go wrong
    • you have to understand the contract between what gets stored in the Evaluator’s history and the metrics functions
    • you have to make sure you attach the Evaluate handler and any logging handlers to the same event

Goals

Evaluating a model is gonna be something that (essentially) everyone does so I think we need to have a good story here. Imo, we should make the supervised case be super easy while still making it possible for non-supervised cases. That being said, I think we want to accomplish this without removing flexibility and without adding a ton of code.

Ideas

Working backward from what I would like the api to be, it might be nice if you could just do something like this:

model = ...
validation_loader = ...
trainer = ...
@trainer.on(Events.EPOCH_COMPLETED)
def evaluate(engine):
  results = evaluate(model, {'acc': categorical_accuracy})
  # do something with those results

It’d be even nicer if I could do something like this:

model = ...
validation_loader = ...
trainer = ...
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(model, {'acc': categorical_accuracy}))

But without making assumptions about how users want to plot/log their evaluation results, this isn’t possible.

What do you all think? Anything here you take issue with? Any ideas on how we can best accomplish this? Do we need to make plotting/logging part of this discussion as well?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:3
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
alykhantejanicommented, Feb 14, 2018

So I came up with something pretty similar to @jasonkriss’s:

def run(train_batch_size, val_batch_size, epochs, lr, momentum, log_interval, logger):
    train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size)
    model = Net()
    optimizer = SGD(model.parameters(), lr=lr, momentum=momentum)
    trainer = create_supervised_trainer(model, optimizer, nn.NLLLoss())
    evaluator = create_evaluator(model, metrics=[categorical_accuracy, nn.NLLLoss()], 
                          mean=[True, True])

    @trainer.on(Events.ITERATION_COMPLETED)
    def log_training_loss(trainer):
        if trainer.current_iteration % log_interval == 0:
            log_simple_moving_average(trainer, window_size=10, logger=logger)


    @trainer.on(Events.EPOCH_COMPLETED)
    def log_validation_results(trainer):
        avg_accuracy, avg_loss = evaluator.run(val_loader)
        logger("Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
               .format(trainer.current_epoch, avg_accuracy, avg_loss))

    trainer.run(train_loader, max_epochs=epochs)

The main bit being:

        evaluator = create_evaluator(model, metrics=[categorical_accuracy, nn.NLLLoss()], 
                              mean=[True, True])
        avg_accuracy, avg_loss = evaluator.run(val_loader)

where the Evaluator takes in a few metrics and wether the mean should be computed (probably better to expand this to some form of generic reduction, but that also makes the interface messy)

To implement this the create_evaluator function would create an Evaluator with a evaluation/inference function that passes the batch through the model and then through each of the metrics and then stores these results to history. The reduction operation could then happen after the data is consumed.

This leads to a few questions/decisions:

  1. The interface to ignite.metrics would have to support a single (prediction, ground_truth) pair. It should probably also support doing this over a list of pairs (for people who want to get rolling window metrics during training)

  2. If we keep the Evaluator object as an instance of Engine, should the object take care of the reduction operations or should this be done via handlers (which we can add to the evaluator in the factory functions)? The passing of data through metrics and storing to history must be done in the inference/evaluation function and takes a different form for supervised, un-supervised, multi-task learning so I don’t think we can make this general (we can however provide factory functions) .

2reactions
vfdev-5commented, Jun 29, 2019

Hey @henrique , yes, now metrics history is not stored by engine’s state. I would say the best way to keep it available is to log it : print, tensorboardx, write into file or any long-term living variable to plot, …). Take a look at our notebook examples:

  • log to TensorboardX: EfficientNet_Cifar100_finetuning or CycleGAN
  • log to a variable to plot with matplotlib: FashionMNIST

HTH

Read more comments on GitHub >

github_iconTop Results From Across the Web

Evaluation Metrics - Towards Data Science
The most commonly used evaluation method is Accuracy. It is the ratio of the correct predicted values over the total predicted values.
Read more >
Metrics-based Approach |
This approach is above all focused on measuring and quantifying outputs and outcomes. It is heavily focused on hard variables (something which can...
Read more >
Evaluation Metrics Machine Learning - Analytics Vidhya
Learn different model evaluation metrics for machine learning like cross validation, confusion matrix, AUC-ROC, RMSE, Gini coefficients and ...
Read more >
Evaluation Metric - an overview | ScienceDirect Topics
One evaluation metric is based on the number of words in the speech that are correctly recognized. Speech recognition systems experience three types...
Read more >
Performance Evaluation Metrics and Approaches for Target ...
PE is referred to the assessment and the evaluation of various performance metrics of a system [13,14], whose significance lies in providing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found