Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`val_check_interval` (int) is not `global_steps` when `accumulate_grad_batches` > 1

See original GitHub issue

🐛 Bug

val_check_interval doc:

use (int) to check every n steps (batches)

But when accumulate_grad_batches > 1, val_check_interval refers to total forward steps. It is wired, because the number of model update is total backward steps (i.e. global_steps), which equals total forward steps / accumulate_grad_batches. Actually, the log_every_n_steps works well, which refers to total backward steps (i.e. global_steps).

To Reproduce

trainer = pl.Trainer(
    # ...
    accumulate_grad_batches=8,
    val_check_interval=100,
    # ...
)

Environment

CUDA: - GPU: - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - available: True - version: 11.3
Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.10.2+cu113 - pytorch-lightning: 1.5.10 - tqdm: 4.63.0
System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.7.10 - version: #1 SMP Fri Mar 19 10:07:22 CST 2021

cc @borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7 @kaushikb11 @awaelchli

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

4reactions

ShaneTiancommented, Mar 6, 2022

Doesn’t this seem strange?

val_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="val_end-{epoch}-{step}-{val_loss:.2f}",
    save_top_k=-1,
    every_n_epochs=1
)
trainer = pl.Trainer(
    # ...
    callbacks=val_ckpt_callback,
    accumulate_grad_batches=8,
    val_check_interval=100,
    # ...
)

The final checkpoints are:

val_end-epoch=0-step=24-val_loss=3.11.ckpt
val_end-epoch=0-step=49-val_loss=2.98.ckpt
val_end-epoch=0-step=74-val_loss=2.45.ckpt
val_end-epoch=0-step=99-val_loss=2.02.ckpt

3reactions

carmoccacommented, Mar 5, 2022

The current design has been like this since the very beginning https://github.com/PyTorchLightning/pytorch-lightning/blob/52f33ac320693331c006060c151b97a1a2eaa57f/pytorch_lightning/models/trainer.py#L126-L127 and I don’t think it will change.

If you need this behavior, you could play with setting trainer.val_check_batch yourself.