`val_check_interval` (int) is not `global_steps` when `accumulate_grad_batches` > 1
See original GitHub issue🐛 Bug
val_check_intervaldoc:
- use (int) to check every n steps (batches)
But when accumulate_grad_batches > 1, val_check_interval refers to total forward steps.
It is wired, because the number of model update is total backward steps (i.e. global_steps), which equals total forward steps / accumulate_grad_batches.
Actually, the log_every_n_steps works well, which refers to total backward steps (i.e. global_steps).
To Reproduce
trainer = pl.Trainer(
# ...
accumulate_grad_batches=8,
val_check_interval=100,
# ...
)
Environment
- CUDA: - GPU: - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - A100-SXM4-40GB - available: True - version: 11.3
- Packages: - numpy: 1.20.1 - pyTorch_debug: False - pyTorch_version: 1.10.2+cu113 - pytorch-lightning: 1.5.10 - tqdm: 4.63.0
- System: - OS: Linux - architecture: - 64bit - - processor: x86_64 - python: 3.7.10 - version: #1 SMP Fri Mar 19 10:07:22 CST 2021
cc @borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7 @kaushikb11 @awaelchli
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Doesn’t this seem strange?
The final checkpoints are:
The current design has been like this since the very beginning https://github.com/PyTorchLightning/pytorch-lightning/blob/52f33ac320693331c006060c151b97a1a2eaa57f/pytorch_lightning/models/trainer.py#L126-L127 and I don’t think it will change.
If you need this behavior, you could play with setting
trainer.val_check_batchyourself.