Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

on_train_end seems to get called before logging of last epoch has finished

See original GitHub issue

🐛 Bug

Maybe not a bug, but unexpected behavior. When using the on_train_end method to either upload a models latest .csv file created by TestTube to neptune or to print the last numeric channel value of a metric send to neptune, the values from the final epoch have not yet been logged. When training has finished, the last line of metrics.csv is 2020-04-02 17:23:16.029189,0.04208208369463682,30.0, but for the outputs/uploads of on_train_end see code below:

Code sample

def on_epoch_end(self):
    # Logging loss per epoch
    train_loss_mean = np.mean(self.training_losses)
    # Saves loss of final epoch for later visualization
    self.final_loss = train_loss_mean
    self.logger[0].experiment.log_metric('epoch/mean_absolute_loss', y=train_loss_mean, x=self.current_epoch)
    self.logger[1].experiment.log({'epoch/mean_absolute_loss': train_loss_mean, 'epoch': self.current_epoch}, global_step=self.current_epoch)
    self.training_losses = []  # reset for next epoch

def on_train_end(self):
    save_dir = Path(self.logger[1].experiment.get_logdir()).parent/'metrics.csv'
    self.logger[0].experiment.log_artifact(save_dir)

Last line of uploaded metrics.csv: 2020-04-02 15:27:57.044250 0.04208208404108882 29.0

def on_train_end(self):
    log_last = self.logger[0].experiment.get_logs()
    print('Last logged values: ', log_last)

Output: Last logged values: {'epoch/mean_absolute_loss': Channel(channelType='numeric', id='b00cd0e5-a427-4a3c-a10c-5033808a930e', lastX=29.0, name='epoch/mean_absolute_loss', x=29.0, y='0.04208208404108882')}

When printing self.final_loss in on_train_end I get the correct last value though.

Expected behavior

The on_train_end method to only get called after the last values have been logged.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Dunrarcommented, Apr 19, 2020

@HenryJia Just tried it, thank you! I’ll close this then 😃

1reaction

HenryJiacommented, Apr 17, 2020

@Dunrar Had a little look at this and your code. on_train_end is not being called before the epoch has finished. It just looks that way. What’s actually happening is that the logs aren’t being finalised/saved until after on_train_end has been called so it looks that way when you look at the logs inside on_train_end.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L693

Adding a self.logger[1].save() to the beginning of on_train_end() (or the end of on_epoch_end()) yields the result you’d expect for me for test_tube logger. I’m not familiar with Neptune but based on the structure of pytorch-lightning the result should be the same if you add self.logger[0].save() as well

Top Results From Across the Web

on_train_end seems to get called before logging of last epoch ...

Expected behavior. The on_train_end method to only get called after the last values have been logged.

Stuck on logging screen - #55 by Ogre - Last Epoch Forums

Hi there, I got this problem after the patch no matter how many times I launch the game I got stuck on logging...

Trainer — PyTorch Lightning 1.8.5.post0 documentation

Under the hood, the Lightning Trainer handles the training loop details for you, some examples include: Automatically enabling/disabling grads. Running the ...

Callbacks - Hugging Face

Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer (this feature is not yet implemented in...

How to call callback after n epochs but always in the last ...

It seems to me it is the simplest/cleanest way to to it, since your callback object does not have access to the total...