Invoke training_epoch_end before validation_epoch_end
See original GitHub issueLooking at https://github.com/PyTorchLightning/pytorch-lightning/issues/2816#issuecomment-669336994 the current event order is:
on_validation_epoch_start
on_validation_epoch_end
on_train_start
on_epoch_start
on_train_epoch_start
on_validation_start
on_validation_epoch_start
on_validation_epoch_end
on_validation_end
on_epoch_end
on_train_epoch_end
on_epoch_start
on_train_epoch_start
on_validation_start
on_validation_epoch_start
on_validation_epoch_end
on_validation_end
on_epoch_end
on_train_epoch_end
on_train_end
I am expecting the following order instead:
on_validation_epoch_start
on_validation_epoch_end
on_epoch_start
* on_train_start
on_train_epoch_start
* on_train_epoch_end
* on_train_end
on_validation_start
on_validation_epoch_start
on_validation_epoch_end
on_validation_end
on_epoch_end
on_epoch_start
on_train_epoch_start
on_validation_start
on_validation_epoch_start
on_validation_epoch_end
on_validation_end
on_epoch_end
on_train_epoch_end
I moved three events: on_train_start
, on_train_epoch
and on_train_end
.
Justification: The training phase always completes before the validation phase begins. The old callback order does not reflect reality.
In terms of LightningModule
’s events, I am seeing the following invocation order:
training_step
training_step_end
validation_step
validation_step_end
validation_epoch_end
training_epoch_end
instead of:
training_step
training_step_end
* training_epoch_end
validation_step
validation_step_end
validation_epoch_end
Justification: The actual invocation order contradicts the order specified in the method documentation.
I’m trying to set a metric in validation_epoch_end
that depends on the training loss but I cannot do this unless training_epoch_end
completes before validation_epoch_end
.
Additionally, I think that anyone who reads train_loss
from inside validation_epoch_end
will actually get the value of the metric from the previous epochs. Meaning, if I read train_loss
during the validation of epoch 10 I will actually see train_loss
from epoch 9. Anyone making use of this data thus runs the risk of corrupting their training.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Dear @cowwoc,
Thanks for the clarification.
If you use a LossTensor as described below, you have full control on when the reduction is being performed, so you can access the
train_loss_epoch
on validation epoch end.Yes, you can but you need to rely a bit more on the Trainer internal. The Trainer keeps track of the is_last_batch or you can use on_train_start, on_validation_start.
@tchaton Is this workaround no longer needed due to “Re-define the current_epoch boundary” changes in https://github.com/PyTorchLightning/pytorch-lightning/releases/tag/1.6.0?
Does
training_epoch_end
now get invoked beforevalidation_epoch_end
?