Multiple validations per epoch with PytorchLightning
See original GitHub issueWith large datasets it sometimes makes sense to check the validation set before the epoch ends, because the model could reach decent performance even with just 1-2 epochs, so we want to monitor it early.
This is exactly my scenario. When performing more than one validation per epoch I get this warning message from optuna:
.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported
Expected behavior
Every value should be reported; in the case in which Pytorch Lightning is used (like in my case), a better indicator for the progress of the training is not epoch
but step
(1 step = 1 forward pass).
Even more importantly than just ignoring to report a value, it should be checked that this behavior is not interfering with the hyperparameters optimization logic.
Environment
- Optuna version: 2.9.1
- Python version: 3.8.5
- OS: Linux-3.10.0-1127.18.2.el7.x86_64-x86_64-with-glibc2.10
- (Optional) Other libraries and their versions:
- pytorch-lightning: 1.4.5
Error messages, stack traces, or logs
(reporting the one written above)
.../site-packages/optuna/trial/_trial.py:590: UserWarning: The reported value is ignored because this `step` 1 is already reported
Steps to reproduce
Run optuna on a model where the validation phase is done multiple times per epoch. For instance train any model with PyTorch Lightning with the following trainer:
trainer = pl.Trainer(
gpus=1,
max_epochs=1,
callbacks=[
PyTorchLightningPruningCallback(trial, monitor="val_accuracy"), # set your metric here
],
num_sanity_val_steps=0,
val_check_interval=1, # an epoch will be many steps, so here we are logging multiple times per epoch
)
Issue Analytics
- State:
- Created 2 years ago
- Comments:9
I have the exact same situation: My epoch size is relatively large (multiple hours per epoch), so I perform multiple validation steps per epoch. I’m not familiar enough with optuna trials to know exactly what the impact is of using
global_step
, which may result in skipped steps when only getting called onon_validation_end
; I think pruners also use the internal step count, so skipping steps might not be ideal?(Personally) I think the definition of an epoch is not crystal clear or useful; e.g. when using PyTorch DDP + many machines, an ‘epoch’ processes the whole dataset multiple times per ‘epoch’ anyway…
Therefore I would like to suggest an implementation which:
global_step
if we can confirm this is side-effect freetraining_step_end
when tracking train metrics and usesglobal_step
This issue was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.