question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple validations per epoch with PytorchLightning

See original GitHub issue

With large datasets it sometimes makes sense to check the validation set before the epoch ends, because the model could reach decent performance even with just 1-2 epochs, so we want to monitor it early.

This is exactly my scenario. When performing more than one validation per epoch I get this warning message from optuna:

.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported

Expected behavior

Every value should be reported; in the case in which Pytorch Lightning is used (like in my case), a better indicator for the progress of the training is not epoch but step (1 step = 1 forward pass).

Even more importantly than just ignoring to report a value, it should be checked that this behavior is not interfering with the hyperparameters optimization logic.

Environment

  • Optuna version: 2.9.1
  • Python version: 3.8.5
  • OS: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.10
  • (Optional) Other libraries and their versions:
    • pytorch-lightning: 1.4.5

Error messages, stack traces, or logs

(reporting the one written above)

.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported

Steps to reproduce

Run optuna on a model where the validation phase is done multiple times per epoch. For instance train any model with PyTorch Lightning with the following trainer:

 trainer = pl.Trainer(
        gpus=1,
        max_epochs=1,
        callbacks=[
            PyTorchLightningPruningCallback(trial, monitor="val_accuracy"),  # set your metric here
        ],
        num_sanity_val_steps=0,
        val_check_interval=1,  # an epoch will be many steps, so here we are logging multiple times per epoch
    )

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
robogastcommented, Oct 18, 2021

I have the exact same situation: My epoch size is relatively large (multiple hours per epoch), so I perform multiple validation steps per epoch. I’m not familiar enough with optuna trials to know exactly what the impact is of using global_step, which may result in skipped steps when only getting called on on_validation_end; I think pruners also use the internal step count, so skipping steps might not be ideal?

(Personally) I think the definition of an epoch is not crystal clear or useful; e.g. when using PyTorch DDP + many machines, an ‘epoch’ processes the whole dataset multiple times per ‘epoch’ anyway…

Therefore I would like to suggest an implementation which:

  • uses global_step if we can confirm this is side-effect free
  • or:
    • either gets called on training_step_end when tracking train metrics and uses global_step
    • which counts the number of performed validation steps when tracking validation metrics
0reactions
github-actions[bot]commented, Feb 9, 2022

This issue was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelize the validation of 1 epoch with training of another in ...
I am training a pytorch lightning model on a GPU. I validating it at every epoch and each validation run takes about 30...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
It replicates some samples on some devices to make sure all devices have same batch size in case of uneven inputs. Validation Epoch-level...
Read more >
PyTorch Lightning - Check val split every n epochs - YouTube
In this video, we give a short intro to Lightning's flag check_val_every_n- epoch.To learn more about Lightning, please visit the official ...
Read more >
How to organize and track your PyTorch training by creating a ...
... For each epoch, we have two steps: the training step and the validation step. In the training step, we perform the actual...
Read more >
Training epochs and validation epochs are not matched
When I tried to save each epoch's loss and accuracy data, I wanted to plot them with ... I logged it in two...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found