Evaluation/Metrics approach
See original GitHub issueThis has been discussed a bit in other issues, but I wanted to make a dedicated issue for us to discuss this as I think it’s very important we get this right.
Background
Some previous discussions here and here.
Throughout, I am going to use the motivating example of training a supervised model where you periodically want to compute some metrics against a validation set.
Current Setup
Currently, in order to accomplish this, you need to do the following:
- create an
Evaluator
- register the
Evaluate
handler to run theEvaluator
on the validation set and store the predictions in the history - add another event handler to actually use this history to compute the metrics you care about
- log/plot these metrics however you choose
In code, this looks something like this:
model = ...
validation_loader = ...
trainer = ...
evaluator = create_supervised_evaluator(model, cuda=True)
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(evaluator, validation_loader, epoch_interval=1))
@trainer.on(Events.EPOCH_COMPLETED)
def log(engine):
print(engine.current_epoch, categorical_accuracy(evaluator.history))
Pros
- keeps library code cleanly separated with minimal implicit dependencies
- user doesn’t have to write much code
Cons
- can be confusing what happens where
- we have both an
Evaluator
and anEvaluate
and yet neither one computes any sort of metrics
- we have both an
- there are a lot of ways this could go wrong
- you have to understand the contract between what gets stored in the
Evaluator
’s history and the metrics functions - you have to make sure you attach the
Evaluate
handler and any logging handlers to the same event
- you have to understand the contract between what gets stored in the
Goals
Evaluating a model is gonna be something that (essentially) everyone does so I think we need to have a good story here. Imo, we should make the supervised case be super easy while still making it possible for non-supervised cases. That being said, I think we want to accomplish this without removing flexibility and without adding a ton of code.
Ideas
Working backward from what I would like the api to be, it might be nice if you could just do something like this:
model = ...
validation_loader = ...
trainer = ...
@trainer.on(Events.EPOCH_COMPLETED)
def evaluate(engine):
results = evaluate(model, {'acc': categorical_accuracy})
# do something with those results
It’d be even nicer if I could do something like this:
model = ...
validation_loader = ...
trainer = ...
trainer.add_event_handler(Events.EPOCH_COMPLETED, Evaluate(model, {'acc': categorical_accuracy}))
But without making assumptions about how users want to plot/log their evaluation results, this isn’t possible.
What do you all think? Anything here you take issue with? Any ideas on how we can best accomplish this? Do we need to make plotting/logging part of this discussion as well?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:8 (6 by maintainers)
Top GitHub Comments
So I came up with something pretty similar to @jasonkriss’s:
The main bit being:
where the
Evaluator
takes in a few metrics and wether the mean should be computed (probably better to expand this to some form of generic reduction, but that also makes the interface messy)To implement this the
create_evaluator
function would create anEvaluator
with a evaluation/inference function that passes the batch through the model and then through each of the metrics and then stores these results to history. The reduction operation could then happen after the data is consumed.This leads to a few questions/decisions:
The interface to
ignite.metrics
would have to support a single(prediction, ground_truth)
pair. It should probably also support doing this over a list of pairs (for people who want to get rolling window metrics during training)If we keep the
Evaluator
object as an instance ofEngine
, should the object take care of the reduction operations or should this be done via handlers (which we can add to the evaluator in the factory functions)? The passing of data through metrics and storing to history must be done in the inference/evaluation function and takes a different form for supervised, un-supervised, multi-task learning so I don’t think we can make this general (we can however provide factory functions) .Hey @henrique , yes, now metrics history is not stored by engine’s state. I would say the best way to keep it available is to log it : print, tensorboardx, write into file or any long-term living variable to plot, …). Take a look at our notebook examples:
HTH