Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation step metrics not logged

See original GitHub issue

❓ Questions and Help

It seems like data output in the validation_step does not get logged to tensorboard, it needs to be aggregated first in the validation_epoch_end, which is not the case for training_step.

The below would only show val_loss and only aggregated but all of mae, mape etc from every iteration.

As a workaround I could explicitly log, but how do I get the current iteration in the callbacks I only see how to get the current epoch.

    def step(self, y_hat, y, mode='train'):
        loss = F.mse_loss(y_hat, y)
        mae = F.l1_loss(y_hat, y)
        mape = median_absolute_percentage_error(y_hat, y)
        r2 = r2_score(y_hat, y)
        out = {'loss': loss, 'mae': mae, 'mape': mape, 'R2': r2}
        if mode=='train':
            out['log'] = out.copy()
            return out
        elif mode =='val':
            out = {f'{mode}_{k}': v for k,v in out.items()}
            out['log'] = out.copy()
            return out
        else:
            raise Exception('Unsupported mode')
    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        return self.step(y_hat, y, 'val')

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

awaelchlicommented, Jun 7, 2020

you can get the current step with self.global_step or self.trainer.global_step

yes, it’s true you can manually log it yourself, but there is a reason why we don’t log each step of the validation, because the loggers (all as far as I know) use a global step for logging, and this means if your training epoch has n bathces, and your validation has m batches, after the first epoch you will log n + m steps for training + validation and then your training loss will contingue with step n+m+1 instead of n+1 for epoch 2. You will see a big jump in the visualization.

TensorBoard is too limited, you cannot set the abscissa to anything else than the step (as far as I know). Therefore logging validation step makes no sense.

0reactions

stale[bot]commented, Aug 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Validation step metrics not logged · Issue #2102 · Lightning-AI ...

It seems like data output in the validation_step does not get logged to tensorboard, it needs to be aggregated first in the validation_epoch_end ......

How to monitor both train and validation metrics at the same ...

Hi, I am finetuning BertForSequenceClassification using run_glue.py and I would like to output every logging_steps all the performance ...

TorchMetrics in PyTorch Lightning

Logging metrics can be done in two ways: either logging the metric object directly or the computed metric values. When Metric objects, which...

Metrics & Performance - Documentation - Weights & Biases

Can I log metrics on two different time scales? (For example, I want to log training accuracy per batch and validation accuracy per...

TensorBoard Scalars: Logging training metrics in Keras

TensorBoard Scalars: Logging training metrics in Keras ... note how both training and validation loss rapidly decrease, ... Not bad!