question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`val_check_interval` (int) is not `global_steps` when `accumulate_grad_batches` > 1

See original GitHub issue

🐛 Bug

val_check_interval doc:

  • use (int) to check every n steps (batches)

But when accumulate_grad_batches > 1, val_check_interval refers to total forward steps. It is wired, because the number of model update is total backward steps (i.e. global_steps), which equals total forward steps / accumulate_grad_batches. Actually, the log_every_n_steps works well, which refers to total backward steps (i.e. global_steps).

To Reproduce

trainer = pl.Trainer(
    # ...
    accumulate_grad_batches=8,
    val_check_interval=100,
    # ...
)

Environment

  • CUDA: - GPU: - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - available: True - version: 11.3
  • Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.10.2+cu113 - pytorch-lightning: 1.5.10 - tqdm: 4.63.0
  • System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.7.10 - version: #1 SMP Fri Mar 19 10:07:22 CST 2021

cc @borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7 @kaushikb11 @awaelchli

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
ShaneTiancommented, Mar 6, 2022

Doesn’t this seem strange?

val_ckpt_callback = pl.callbacks.ModelCheckpoint(
    filename="val_end-{epoch}-{step}-{val_loss:.2f}",
    save_top_k=-1,
    every_n_epochs=1
)
trainer = pl.Trainer(
    # ...
    callbacks=val_ckpt_callback,
    accumulate_grad_batches=8,
    val_check_interval=100,
    # ...
)

The final checkpoints are:

val_end-epoch=0-step=24-val_loss=3.11.ckpt
val_end-epoch=0-step=49-val_loss=2.98.ckpt
val_end-epoch=0-step=74-val_loss=2.45.ckpt
val_end-epoch=0-step=99-val_loss=2.02.ckpt
3reactions
carmoccacommented, Mar 5, 2022

The current design has been like this since the very beginning https://github.com/PyTorchLightning/pytorch-lightning/blob/52f33ac320693331c006060c151b97a1a2eaa57f/pytorch_lightning/models/trainer.py#L126-L127 and I don’t think it will change.

If you need this behavior, you could play with setting trainer.val_check_batch yourself.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found