Unexpected _run_once_on_dataset behavior if using a custom dataloader that implements __len__ method
See original GitHub issue🐛 Bug description
Context - I was calling engine.run(my_run_function), with a custom dataloader that implements __len__
method.
What happened - Looks like when my run hits _run_once_on_dataset, the expected behavior is to break out of while loop if dataloader raises StopIteration. Based on another discussions, we expect self.state.epoch_length
to be None therefore hit break from there. However during previous steps, self.state.epoch_length
would be set to a positive integer since self.state.max_epochs
is None for a new run(instead of reading from state_dict) and I have __len__
method defined, even if I didn’t provide epoch_length
param to run
function.
What I expect - Upon StopIteration, dataloader is exhausted for one epoch, exit _run_once_on_dataset
without reinitializing dataloader and re-enter the while loop.
Environment
- PyTorch Version (e.g., 1.4): 1.10
- Ignite Version (e.g., 0.3.0): 0.4.9
- OS (e.g., Linux): MacOS
- How you installed Ignite (
conda
,pip
, source): pip - Python version: 3.9
- Any other relevant information:
Issue Analytics
- State:
- Created a year ago
- Comments:7
Converting explicitly data into an iterator, like
iter(train_data)
, an impact to keep in mind is that if we need to run more than 1 epoch, we have to restart the iterator manually like that:Source: https://pytorch-ignite.ai/how-to-guides/06-data-iterator/
Another solution (that seems you have already tried) could be to explicitly specify data size with
epoch_length
argument if we know it and if the value is reliable.Appreciate the tip! Discarding dataloader size fixed my use case since it effectively used
StopIteration
. QQ - Is there other impacts to keep in mind when doing so?FWIW I’d love to make a snippet for us to reproduce the error but my data loading and training are async, and I find it challenging given the complexity. In essence, after loading last partition, not all batches have gone through training. That’s where even when
__len__
is correct set to total number of batches, engine will restart dataloader.Thanks for following along, happy to close the issue!