Improve the example of Timer's usage
See original GitHub issue📚 Documentation
The example in the `Timer’s documentation suggest to estimate the processing time for a batch as follows:
timer.attach(
trainer,
start=Events.EPOCH_STARTED,
resume=Events.ITERATION_STARTED,
pause=Events.ITERATION_COMPLETED,
step=Events.ITERATION_COMPLETED)
It is to note that this timer will be reset at the start of each epoch. When we use multiple workers in the data loader, the first couple of iterations at each epoch often take longer than the later ones. The reset behavior will incur an inaccurate estimation of the remaining training time (ETA), even when the average
flag is set to True
. Specifically, the ETA is computed as (engine.state.max_iters - engine.state.iteration) * time_per_iter
. So the small fluctuation in time_per_iter
will be magnified by remaining number of iterations. To address this problem, we can let the timer only start once in the whole training process:
timer.attach(
trainer,
start=Events.EPOCH_STARTED(once=1),
resume=Events.ITERATION_STARTED,
pause=Events.ITERATION_COMPLETED,
step=Events.ITERATION_COMPLETED)
I have empirically verified the effectiveness of this modification.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
@vfdev-5 It also works. Please see the following log:
@sandylaker thanks ! Would you like to send a PR with a fix ?