Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve the example of Timer's usage

See original GitHub issue

📚 Documentation

The example in the `Timer’s documentation suggest to estimate the processing time for a batch as follows:

timer.attach(
    trainer, 
    start=Events.EPOCH_STARTED, 
    resume=Events.ITERATION_STARTED, 
    pause=Events.ITERATION_COMPLETED, 
    step=Events.ITERATION_COMPLETED)

It is to note that this timer will be reset at the start of each epoch. When we use multiple workers in the data loader, the first couple of iterations at each epoch often take longer than the later ones. The reset behavior will incur an inaccurate estimation of the remaining training time (ETA), even when the average flag is set to True. Specifically, the ETA is computed as (engine.state.max_iters - engine.state.iteration) * time_per_iter. So the small fluctuation in time_per_iter will be magnified by remaining number of iterations. To address this problem, we can let the timer only start once in the whole training process:

 timer.attach(
    trainer, 
    start=Events.EPOCH_STARTED(once=1), 
    resume=Events.ITERATION_STARTED, 
    pause=Events.ITERATION_COMPLETED, 
    step=Events.ITERATION_COMPLETED)

I have empirically verified the effectiveness of this modification.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

sandylakercommented, Aug 21, 2021

@vfdev-5 It also works. Please see the following log:

[0]<stderr>:2021-08-21 23:33:50,758 ignite INFO: Epoch [1/100] [10/50]: batch time: 3.13489; eta: 4:20:43; lr: 4.578e-06; train loss: 2.40777
[0]<stderr>:2021-08-21 23:34:16,061 ignite INFO: Epoch [1/100] [20/50]: batch time: 2.76993; eta: 3:49:54; lr: 8.554e-06; train loss: 2.15799
[0]<stderr>:2021-08-21 23:34:45,482 ignite INFO: Epoch [1/100] [30/50]: batch time: 2.34365; eta: 3:14:07; lr: 1.253e-05; train loss: 2.26927
[0]<stderr>:2021-08-21 23:35:11,477 ignite INFO: Epoch [1/100] [40/50]: batch time: 2.39976; eta: 3:18:22; lr: 1.651e-05; train loss: 2.03384
[0]<stderr>:2021-08-21 23:35:25,625 ignite INFO: Epoch [1/100] [0/50]: batch time: 2.10667; eta: 2:53:48; lr: 2.048e-05; train loss: 1.02629
[0]<stderr>:2021-08-21 23:35:57,737 ignite INFO: Epoch [1/100]: validation epoch time: 0:00:31; val_loss: 1.90737; accuracy: 0.43252; precision: 0.11525; recall: 0.[0]<stderr>:
[0]<stderr>:2021-08-21 23:36:26,375 ignite INFO: Epoch [2/100] [10/50]: batch time: 2.08878; eta: 2:51:58; lr: 2.446e-05; train loss: 1.73261
[0]<stderr>:2021-08-21 23:36:46,239 ignite INFO: Epoch [2/100] [20/50]: batch time: 1.94054; eta: 2:39:26; lr: 2.843e-05; train loss: 1.80995
[0]<stderr>:2021-08-21 23:37:06,633 ignite INFO: Epoch [2/100] [30/50]: batch time: 1.79434; eta: 2:27:08; lr: 3.241e-05; train loss: 1.76788

0reactions

vfdev-5commented, Aug 22, 2021

@sandylaker thanks ! Would you like to send a PR with a fix ?