Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue of reusing epoch_iter introduced from the latest commit

See original GitHub issue

@ngoyal2707 @myleott At this line from this commit, self.epoch_iter will not be updated if not None, i.e., the iterator would be the same across epochs. However, we need different seeds for the epoch iterator in different epochs for shuffling; otherwise we will get the same batch order across epochs… So IMO returning same iterator across epochs might not be a good idea in any cases.

PS: This also made the validation log printing starting from epoch 2. I think it is because the training iterator will even be reused for validation when reusing self.epoch_iter like that.

Issue Analytics

State:
Created 4 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

ngoyal2707commented, Sep 23, 2019

yeah, the reason to change this was similar to what you mentioned as we were seeing slightly different results between single and multi gpu setup because of this extra rand() call in multi-gpu case.

I don’t think the right solution here is to keep the rand() call. DO you see big difference in your experiments with and without it? Worse case, I’d suggest add that rand() call in your task __init__?

(cc: @myleott )

1reaction

freewymcommented, Sep 23, 2019

@ngoyal2707 OK, I find the reason why my previous experiment was not reproducible: this commit replaced torch.rand() with torch.zeros(), affecting the state of RNG when doing multi-gpu training. It may sound strange, but is it possible to revert back to torch.rand() so that all the previous experiments are reproducible?