Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple validations per epoch with PytorchLightning

See original GitHub issue

With large datasets it sometimes makes sense to check the validation set before the epoch ends, because the model could reach decent performance even with just 1-2 epochs, so we want to monitor it early.

This is exactly my scenario. When performing more than one validation per epoch I get this warning message from optuna:

.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported

Expected behavior

Every value should be reported; in the case in which Pytorch Lightning is used (like in my case), a better indicator for the progress of the training is not epoch but step (1 step = 1 forward pass).

Even more importantly than just ignoring to report a value, it should be checked that this behavior is not interfering with the hyperparameters optimization logic.

Environment

Optuna version: 2.9.1
Python version: 3.8.5
OS: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.10
(Optional) Other libraries and their versions:
- pytorch-lightning: 1.4.5

Error messages, stack traces, or logs

(reporting the one written above)

.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported

Steps to reproduce

Run optuna on a model where the validation phase is done multiple times per epoch. For instance train any model with PyTorch Lightning with the following trainer:

 trainer = pl.Trainer(
        gpus=1,
        max_epochs=1,
        callbacks=[
            PyTorchLightningPruningCallback(trial, monitor="val_accuracy"),  # set your metric here
        ],
        num_sanity_val_steps=0,
        val_check_interval=1,  # an epoch will be many steps, so here we are logging multiple times per epoch
    )

Issue Analytics

State:
Created 2 years ago
Comments:9

Top GitHub Comments

1reaction

robogastcommented, Oct 18, 2021

I have the exact same situation: My epoch size is relatively large (multiple hours per epoch), so I perform multiple validation steps per epoch. I’m not familiar enough with optuna trials to know exactly what the impact is of using global_step, which may result in skipped steps when only getting called on on_validation_end; I think pruners also use the internal step count, so skipping steps might not be ideal?

(Personally) I think the definition of an epoch is not crystal clear or useful; e.g. when using PyTorch DDP + many machines, an ‘epoch’ processes the whole dataset multiple times per ‘epoch’ anyway…

Therefore I would like to suggest an implementation which:

uses global_step if we can confirm this is side-effect free
or:
- either gets called on training_step_end when tracking train metrics and uses global_step
- which counts the number of performed validation steps when tracking validation metrics

0reactions

github-actions[bot]commented, Feb 9, 2022

This issue was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

Top Results From Across the Web

Parallelize the validation of 1 epoch with training of another in ...

I am training a pytorch lightning model on a GPU. I validating it at every epoch and each validation run takes about 30...

LightningModule - PyTorch Lightning - Read the Docs

It replicates some samples on some devices to make sure all devices have same batch size in case of uneven inputs. Validation Epoch-level...

PyTorch Lightning - Check val split every n epochs - YouTube

In this video, we give a short intro to Lightning's flag check_val_every_n- epoch.To learn more about Lightning, please visit the official ...

How to organize and track your PyTorch training by creating a ...

... For each epoch, we have two steps: the training step and the validation step. In the training step, we perform the actual...

Training epochs and validation epochs are not matched

When I tried to save each epoch's loss and accuracy data, I wanted to plot them with ... I logged it in two...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Multiple validations per epoch with PytorchLightning

Expected behavior

Environment

Error messages, stack traces, or logs

Steps to reproduce

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

optuna.integration.OptunaSearchCV does not support multi-metric scoring evaluation

[RFC] Unify appearance of `plot_param_importances` between plotly and matplotlib backends