Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ReduceLROnPlateau does not recognise val_loss despite progress_bar dict

See original GitHub issue

🐛 Bug

When training my model, I get the following message:

  File "C:\Users\Luc\Miniconda3\envs\pytorch\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 371, in train
    raise MisconfigurationException(m)
pytorch_lightning.utilities.debugging.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: loss

Ihis is similar to #321for instance, but I definitely return a progress_bar dict with a val_loss key in it (see code below).

Code sample

  def training_step(self, batch, batch_idx):
       z, y_true = batch
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'loss': loss_val.sqrt()}

   def validation_step(self, batch, batch_idx):
       z, y_true = batch
       lr = torch.tensor(self.optim.param_groups[0]['lr'])
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'val_loss': loss_val.sqrt(), 'lr': lr}

   def validation_epoch_end(self, outputs):
       val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()
       lr = outputs[-1]['lr']
       logs = {'val_loss': val_loss_mean, 'lr': lr}
       return {'val_loss': val_loss_mean, 'progress_bar': logs, 'log': logs}

Expected behavior

The val_loss value should be picked up by the progress bar.

Environment

PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): Windows 10
How you installed PyTorch (conda, pip, source): pip
Python version: 3.6.10
CUDA/cuDNN version: 10
GPU models and configuration: 1070Ti x 1
Any other relevant information:

Issue Analytics

State:
Created 4 years ago
Comments:15 (12 by maintainers)

Top GitHub Comments

1reaction

SkafteNickicommented, Mar 30, 2020

I do not think it is possible just out of the box. However, if you configure your scheduler correctly, then it should be possible. For example, if I initialize my Trainer as trainer = Trainer(val_check_interval=50) and initialize my scheduler as

scheduler = {
    'schduler': ReduceLROnPlateau(optimizer, mode, factor, patience),
    'interval': 'step',
    'frequency': 100
}

it should work (not tested), since val_loss will be created every 50 steps but the scheduler will first be called after 100 steps.

1reaction

SkafteNickicommented, Mar 25, 2020

Okay, after looking at your code @alexeykarnachev, this does not seems to be a bug. When you set interval': 'step' you are calling the .step() method for ReduceLROnPlateau after each batch and it therefore makes complete sense that no val_loss is calculated yet. If you really want to do something like this, you need to set val_check_interval in the Trainer construction to a number lower than frequency in the scheduler construction. In this way val_loss will be calculated before .step() is called.

Top Results From Across the Web

ReduceLROnPlateau does not recognise val_loss despite ...

ReduceLROnPlateau does not recognise val_loss despite progress_bar dict # ... The val_loss value should be picked up by the progress bar.

ReduceLROnPlateau conditioned on metric - PyTorch Lightning

I run into this error, I don't understand about the available metrics, why are those things? pytorch_lightning.utilities.exceptions.

How to use keras ReduceLROnPlateau - Stack Overflow

I want the learning rate to be reduced when training is not progressing. I use ReduceLROnPlateau callback. After first 2 epoch with out...

Changelog — PyTorch Lightning 1.8.5 documentation

Added a warning when the model passed to LightningLite.setup() does not have all ... Fixed main progress bar counter when val_check_interval=int and ...

RaySGD API Documentation — Ray 0.8.7 documentation

You do not need to handle GPU/devices in this function; RaySGD will do that under the hood. data_creator (dict -> Iterable(s)) – Constructor...