Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logger connector callback metrics only contains the last step

See original GitHub issue

🐛 Bug

When I do

def training_step(...):
   self.log("train_los", ...)

and see that log() by default uses mean reduction, I expect the "train_loss" given back to me in the callback metrics in the logger connector to be the average training loss across all examples. But right now, this is actually the training loss of the last batch, because lightning considers this to be an on_step metric. I argue that this is very unintuitive (and also undocumented). This is especially problematic when this metric is used to, e.g., perform epoch selection, since the performance on a single batch can have large variance. Worse, each epoch can have a different batch order, and hence a different last batch, so the loss value isn’t really comparable if this is the case.

If this is a design decision, could you at least let me know how I could achieve my intended purpose?

cc @carmocca @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy

Issue Analytics

State:
Created a year ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

ZhaofengWucommented, May 31, 2022

Got it, thanks a lot!

0reactions

carmoccacommented, May 31, 2022

You can skip step (3) and print both validation_metric and train_loss in on_train_epoch_end. This will also work when validation runs multiple times per epoch