question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

better checking of data returned from training_step

See original GitHub issue

šŸš€ Feature

let’s add more validation checks on what’s returned from training_step and provide the user with useful error messages when they’re not returning the right values.

Motivation

i feel like i’ve seen a lot of users confused about what they’re supposed to return in training_step and validation_step. additionally, i don’t think we document how we treat extra keys as ā€œcallback metricsā€ very well.

Pitch

what you do you think about adding some structure and validation for Trainer’s process_output method?

right now, we have expectations about a set of keys {progress_bar, log, loss, hiddens} and assume everything else is a callback metric. however, this is a silent assumption.

we could instead enforce a more rigid structure:

{
  'loss': loss                   # REQUIRED
  'log': {}                         # optional dict
  'progress_bar': {}       # optional dict
  'hiddens': [h0, c0]     # optional collection of tensors
  'metrics': {}                 # optional dict
}

moreover, we can leverage pydantic to do validation automatically and provide useful error message out of the box when data validation fails.

cc @PyTorchLightning/core-contributors

Alternatives

Do nothing, keep things as they are.

Additional context

This would be a backwards incompatible change.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
jeremyjordancommented, Mar 28, 2020

@Borda given that this proposal is backwards compatible, i think we should get more core contributors to weigh in on the proposed design before moving forward and implementing it.

one thing that is still giving me tension is the fact that there’s a lot of overlap between log, progress_bar, and metrics. progress_bar almost always consists of a subset of log, and metrics (or as they currently stand, arbitrary keys) are typically used to store temporary values to be collated and logged at the end of an epoch. i think there’s room for improvement here.

1reaction
gabisuritacommented, Apr 30, 2020

Shouldn’t we favor the return type to be a strong type? I’ve always wondered why the step return type is not a dataclass or named tuple where loss is a required argument. We could keep the flexibility using some metadata dict argument.

Read more comments on GitHub >

github_iconTop Results From Across the Web

better checking of data returned from training_step Ā· Issue #1256
Motivation. i feel like i've seen a lot of users confused about what they're supposed to return in training_step and validation_stepĀ ...
Read more >
Pipeline Steps - Amazon SageMaker - AWS Documentation
A training step requires an estimator, as well as training and validation data inputs. The following example shows how to create a TrainingStep...
Read more >
Pipelines — sagemaker 2.124.0 documentation
Check job config for QualityCheckStep and ClarifyCheckStep. ... Returns. Response dict from the service. See boto3 client documentation. Return type.
Read more >
Trainer - Hugging Face
Returns the test ~torch.utils.data.DataLoader . Subclass and override this method if you want to inject some custom behavior.
Read more >
Better performance with the tf.data API | TensorFlow Core
Before you continue, check the Build TensorFlow input pipelines guide to learn ... Now, as the data execution time plot shows, while the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found