Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

better checking of data returned from training_step

See original GitHub issue

🚀 Feature

let’s add more validation checks on what’s returned from training_step and provide the user with useful error messages when they’re not returning the right values.

Motivation

i feel like i’ve seen a lot of users confused about what they’re supposed to return in training_step and validation_step. additionally, i don’t think we document how we treat extra keys as “callback metrics” very well.

Pitch

what you do you think about adding some structure and validation for Trainer’s process_output method?

right now, we have expectations about a set of keys {progress_bar, log, loss, hiddens} and assume everything else is a callback metric. however, this is a silent assumption.

we could instead enforce a more rigid structure:

{
  'loss': loss                   # REQUIRED
  'log': {}                         # optional dict
  'progress_bar': {}       # optional dict
  'hiddens': [h0, c0]     # optional collection of tensors
  'metrics': {}                 # optional dict
}

moreover, we can leverage pydantic to do validation automatically and provide useful error message out of the box when data validation fails.

cc @PyTorchLightning/core-contributors

Alternatives

Do nothing, keep things as they are.

Additional context

This would be a backwards incompatible change.

Issue Analytics

State:
Created 3 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

2reactions

jeremyjordancommented, Mar 28, 2020

@Borda given that this proposal is backwards compatible, i think we should get more core contributors to weigh in on the proposed design before moving forward and implementing it.

one thing that is still giving me tension is the fact that there’s a lot of overlap between log, progress_bar, and metrics. progress_bar almost always consists of a subset of log, and metrics (or as they currently stand, arbitrary keys) are typically used to store temporary values to be collated and logged at the end of an epoch. i think there’s room for improvement here.

1reaction

gabisuritacommented, Apr 30, 2020

Shouldn’t we favor the return type to be a strong type? I’ve always wondered why the step return type is not a dataclass or named tuple where loss is a required argument. We could keep the flexibility using some metadata dict argument.