Logger connector callback metrics only contains the last step
See original GitHub issue🐛 Bug
When I do
def training_step(...):
self.log("train_los", ...)
and see that log()
by default uses mean reduction, I expect the "train_loss"
given back to me in the callback metrics in the logger connector to be the average training loss across all examples. But right now, this is actually the training loss of the last batch, because lightning considers this to be an on_step
metric. I argue that this is very unintuitive (and also undocumented). This is especially problematic when this metric is used to, e.g., perform epoch selection, since the performance on a single batch can have large variance. Worse, each epoch can have a different batch order, and hence a different last batch, so the loss value isn’t really comparable if this is the case.
If this is a design decision, could you at least let me know how I could achieve my intended purpose?
cc @carmocca @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Got it, thanks a lot!
You can skip step (3) and print both
validation_metric
andtrain_loss
inon_train_epoch_end
. This will also work when validation runs multiple times per epoch