question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

on_train_end seems to get called before logging of last epoch has finished

See original GitHub issue

🐛 Bug

Maybe not a bug, but unexpected behavior. When using the on_train_end method to either upload a models latest .csv file created by TestTube to neptune or to print the last numeric channel value of a metric send to neptune, the values from the final epoch have not yet been logged. When training has finished, the last line of metrics.csv is 2020-04-02 17:23:16.029189,0.04208208369463682,30.0, but for the outputs/uploads of on_train_end see code below:

Code sample

def on_epoch_end(self):
    # Logging loss per epoch
    train_loss_mean = np.mean(self.training_losses)
    # Saves loss of final epoch for later visualization
    self.final_loss = train_loss_mean
    self.logger[0].experiment.log_metric('epoch/mean_absolute_loss', y=train_loss_mean, x=self.current_epoch)
    self.logger[1].experiment.log({'epoch/mean_absolute_loss': train_loss_mean, 'epoch': self.current_epoch}, global_step=self.current_epoch)
    self.training_losses = []  # reset for next epoch
def on_train_end(self):
    save_dir = Path(self.logger[1].experiment.get_logdir()).parent/'metrics.csv'
    self.logger[0].experiment.log_artifact(save_dir)

Last line of uploaded metrics.csv: 2020-04-02 15:27:57.044250 0.04208208404108882 29.0

def on_train_end(self):
    log_last = self.logger[0].experiment.get_logs()
    print('Last logged values: ', log_last)

Output: Last logged values: {'epoch/mean_absolute_loss': Channel(channelType='numeric', id='b00cd0e5-a427-4a3c-a10c-5033808a930e', lastX=29.0, name='epoch/mean_absolute_loss', x=29.0, y='0.04208208404108882')}

When printing self.final_loss in on_train_end I get the correct last value though.

Expected behavior

The on_train_end method to only get called after the last values have been logged.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Dunrarcommented, Apr 19, 2020

@HenryJia Just tried it, thank you! I’ll close this then 😃

1reaction
HenryJiacommented, Apr 17, 2020

@Dunrar Had a little look at this and your code. on_train_end is not being called before the epoch has finished. It just looks that way. What’s actually happening is that the logs aren’t being finalised/saved until after on_train_end has been called so it looks that way when you look at the logs inside on_train_end.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L693

Adding a self.logger[1].save() to the beginning of on_train_end() (or the end of on_epoch_end()) yields the result you’d expect for me for test_tube logger. I’m not familiar with Neptune but based on the structure of pytorch-lightning the result should be the same if you add self.logger[0].save() as well

Read more comments on GitHub >

github_iconTop Results From Across the Web

on_train_end seems to get called before logging of last epoch ...
Expected behavior. The on_train_end method to only get called after the last values have been logged.
Read more >
Stuck on logging screen - #55 by Ogre - Last Epoch Forums
Hi there, I got this problem after the patch no matter how many times I launch the game I got stuck on logging...
Read more >
Trainer — PyTorch Lightning 1.8.5.post0 documentation
Under the hood, the Lightning Trainer handles the training loop details for you, some examples include: Automatically enabling/disabling grads. Running the ...
Read more >
Callbacks - Hugging Face
Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer (this feature is not yet implemented in...
Read more >
How to call callback after n epochs but always in the last ...
It seems to me it is the simplest/cleanest way to to it, since your callback object does not have access to the total...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found