question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ReduceLROnPlateau does not recognise val_loss despite progress_bar dict

See original GitHub issue

🐛 Bug

When training my model, I get the following message:

  File "C:\Users\Luc\Miniconda3\envs\pytorch\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 371, in train
    raise MisconfigurationException(m)
pytorch_lightning.utilities.debugging.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: loss

Ihis is similar to #321for instance, but I definitely return a progress_bar dict with a val_loss key in it (see code below).

Code sample

  def training_step(self, batch, batch_idx):
       z, y_true = batch
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'loss': loss_val.sqrt()}

   def validation_step(self, batch, batch_idx):
       z, y_true = batch
       lr = torch.tensor(self.optim.param_groups[0]['lr'])
       y_pred = self.forward(z)
       loss_val = self.loss_function(y_pred, y_true)
       return {'val_loss': loss_val.sqrt(), 'lr': lr}

   def validation_epoch_end(self, outputs):
       val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()
       lr = outputs[-1]['lr']
       logs = {'val_loss': val_loss_mean, 'lr': lr}
       return {'val_loss': val_loss_mean, 'progress_bar': logs, 'log': logs}

Expected behavior

The val_loss value should be picked up by the progress bar.

Environment

  • PyTorch Version (e.g., 1.0): 1.4.0
  • OS (e.g., Linux): Windows 10
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.6.10
  • CUDA/cuDNN version: 10
  • GPU models and configuration: 1070Ti x 1
  • Any other relevant information:

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
SkafteNickicommented, Mar 30, 2020

I do not think it is possible just out of the box. However, if you configure your scheduler correctly, then it should be possible. For example, if I initialize my Trainer as trainer = Trainer(val_check_interval=50) and initialize my scheduler as

scheduler = {
    'schduler': ReduceLROnPlateau(optimizer, mode, factor, patience),
    'interval': 'step',
    'frequency': 100
}

it should work (not tested), since val_loss will be created every 50 steps but the scheduler will first be called after 100 steps.

1reaction
SkafteNickicommented, Mar 25, 2020

Okay, after looking at your code @alexeykarnachev, this does not seems to be a bug. When you set interval': 'step' you are calling the .step() method for ReduceLROnPlateau after each batch and it therefore makes complete sense that no val_loss is calculated yet. If you really want to do something like this, you need to set val_check_interval in the Trainer construction to a number lower than frequency in the scheduler construction. In this way val_loss will be calculated before .step() is called.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ReduceLROnPlateau does not recognise val_loss despite ...
ReduceLROnPlateau does not recognise val_loss despite progress_bar dict # ... The val_loss value should be picked up by the progress bar.
Read more >
ReduceLROnPlateau conditioned on metric - PyTorch Lightning
I run into this error, I don't understand about the available metrics, why are those things? pytorch_lightning.utilities.exceptions.
Read more >
How to use keras ReduceLROnPlateau - Stack Overflow
I want the learning rate to be reduced when training is not progressing. I use ReduceLROnPlateau callback. After first 2 epoch with out...
Read more >
Changelog — PyTorch Lightning 1.8.5 documentation
Added a warning when the model passed to LightningLite.setup() does not have all ... Fixed main progress bar counter when val_check_interval=int and ...
Read more >
RaySGD API Documentation — Ray 0.8.7 documentation
You do not need to handle GPU/devices in this function; RaySGD will do that under the hood. data_creator (dict -> Iterable(s)) – Constructor...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found