question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

epoch_end Logs Default to Steps Instead of Epochs

See original GitHub issue

🐛 Bug

Logs generated within validation_epoch_end have their iteration set to the number of steps instead of number of epochs.

To Reproduce

Steps to reproduce the behavior:

  1. Create LightningModile with the below functions.
  2. Train the module for 1 or more epochs
  3. Run Tensorboard
  4. Note the iteration for logs generated in validation_epoch_end

Code Sample

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        loss = self.cross_entropy_loss(y_hat, y)
        accuracy = self.accuracy(y_hat, y)
        logs = {'batch/train_loss': loss}
        return {'loss': loss, 'log': logs}

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        y_hat = self.forward(x)
        loss = self.cross_entropy_loss(y_hat, y)
        accuracy = self.accuracy(y_hat, y)
        return {'batch/val_loss': loss, 'batch/accuracy': accuracy}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['batch/val_loss'] for x in outputs]).mean()
        avg_accuracy = torch.stack([x['batch/accuracy'] for x in outputs]).mean()
        tensorboard_logs = {'epoch/val_loss': avg_loss, 'epoch/accuracy' : avg_accuracy}
        return {'avg_val_loss': avg_loss, 'accuracy' : avg_accuracy, 'log': tensorboard_logs}

Version

v0.7.1

Screenshot

image

Expected behavior

Logs generated in *_epoch_end functions use the epoch as their iteration on the y axis


The documentation seems unclear on the matter. The Loggers documentation is empty. The LightningModule class doesn’t describe logging in detail and the Introduction Guide has a bug in the example self.logger.summary.scalar('loss', loss) -> AttributeError: 'TensorBoardLogger' object has no attribute 'summary'.

Is there a workaround for this issue?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

4reactions
0x6b756d6172commented, Mar 15, 2020

Looks like the LightningModule documentation for validation_epoch_end has a note regarding this, to use step key as in: 'log': {'val_acc': val_acc_mean.item(), 'step': self.current_epoch}. This does work correctly as a workaround - perhaps this should be the default for epoch end?

1reaction
Rilwan-Acommented, Nov 27, 2020

Looks like the LightningModule documentation for validation_epoch_end has a note regarding this, to use step key as in: 'log': {'val_acc': val_acc_mean.item(), 'step': self.current_epoch}. This does work correctly as a workaround - perhaps this should be the default for epoch end?

In the current version of PytorchLightening we are encouraged to use self.log() to log statistics, as opposed to returning a dictionary as mentioned in the previous comment. I can not see an obvious way to control the step with the new self.log() method?

Read more comments on GitHub >

github_iconTop Results From Across the Web

epoch_end Logs Default to Steps Instead of Epochs #1151
Bug Logs generated within validation_epoch_end have their iteration set to the number of steps instead of number of epochs.
Read more >
LightningModule - PyTorch Lightning - Read the Docs
The log() object automatically reduces the requested metrics across a complete epoch and devices. Here's the pseudocode of what it does under the...
Read more >
Is there a way to only log on epoch end using the new ...
I was able to do this by forcing step to be self.current_epoch (essentially ignoring global_step ) with the old dict interface, but TrainResult ......
Read more >
What is the difference between steps and epochs in ...
A training step is one gradient update. In one step batch_size examples are processed. An epoch consists of one full cycle through the...
Read more >
Callbacks - Keras 2.1.3 Documentation
The logs dictionary that callback methods take as argument will contain keys for quantities ... Events are sent to root + '/publish/epoch/end/' by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found