Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about return value of `validation_epoch_end`

See original GitHub issue

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I’m a bit confused about what to return from methods like validation_epoch_end and what to put inside its log member.

Based on the document the log member of the return value of validation_epoch_end mainly for logging and plotting?

In the MNIST example, if I change the validation_epoch_end method to


def validation_epoch_end(self, outputs):
   # OPTIONAL
   avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
   tensorboard_logs = {'val_loss': avg_loss}
   return {'avg_val_loss': avg_loss}

I will get a RuntimeWarning: Can save best model only with val_loss available, skipping.. It seems that it’s looking metrics inside the log member to determine best model.

If I change the training_stepmethod to


def training_step(self, batch, batch_nb):
   # REQUIRED
   x, y = batch
   y_hat = self.forward(x)
   loss = F.cross_entropy(y_hat, y)
   tensorboard_logs = {'train_loss': loss}
   return {'log': tensorboard_logs}

and only put train_loss inside log, I will get a RuntimeError: No loss value in the dictionary returned frommodel.training_step(). It seems that some procedure is looking for value inside the return value but not its log member.

I’m confused about what to put inside these methods’ return value and their log member.

Updated:

Now I encountered this issue, I’m getting more and more confused why the test result will be found in return of progress_bar member…

Maybe I’m missing something, but I didn’t find details of all theses in the docs.

Versions

pytorch-lightning: 0.7.1.

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

dscarmocommented, Mar 15, 2020

You need to return whatever metric the checkpoint callback is using to monitor the best model. In this case, val_loss is used to monitor for the best model, and you need to return it separately from the logs.

In the same vein, the backward is performed on the “loss” key of the return dict from training_step, so you need to have defined a “loss” return.

0reactions

stale[bot]commented, May 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.