Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected _run_once_on_dataset behavior if using a custom dataloader that implements len method

See original GitHub issue

🐛 Bug description

Context - I was calling engine.run(my_run_function), with a custom dataloader that implements __len__ method.

What happened - Looks like when my run hits _run_once_on_dataset, the expected behavior is to break out of while loop if dataloader raises StopIteration. Based on another discussions, we expect self.state.epoch_length to be None therefore hit break from there. However during previous steps, self.state.epoch_length would be set to a positive integer since self.state.max_epochs is None for a new run(instead of reading from state_dict) and I have __len__ method defined, even if I didn’t provide epoch_length param to run function.

What I expect - Upon StopIteration, dataloader is exhausted for one epoch, exit _run_once_on_dataset without reinitializing dataloader and re-enter the while loop.

Environment

PyTorch Version (e.g., 1.4): 1.10
Ignite Version (e.g., 0.3.0): 0.4.9
OS (e.g., Linux): MacOS
How you installed Ignite (conda, pip, source): pip
Python version: 3.9
Any other relevant information:

Issue Analytics

State:
Created a year ago
Comments:7

Top GitHub Comments

1reaction

vfdev-5commented, Jun 27, 2022

QQ - Is there other impacts to keep in mind when doing so?

Converting explicitly data into an iterator, like iter(train_data), an impact to keep in mind is that if we need to run more than 1 epoch, we have to restart the iterator manually like that:

@trainer.on(Events.DATALOADER_STOP_ITERATION)
def restart_iter():
    trainer.state.dataloader = iter(train_data)

trainer.run(iter(train_data), max_epochs=2)

Source: https://pytorch-ignite.ai/how-to-guides/06-data-iterator/

Another solution (that seems you have already tried) could be to explicitly specify data size with epoch_length argument if we know it and if the value is reliable.

1reaction

AaamberWcommented, Jun 27, 2022

Appreciate the tip! Discarding dataloader size fixed my use case since it effectively used StopIteration. QQ - Is there other impacts to keep in mind when doing so?

FWIW I’d love to make a snippet for us to reproduce the error but my data loading and training are async, and I find it challenging given the complexity. In essence, after loading last partition, not all batches have gone through training. That’s where even when __len__ is correct set to total number of batches, engine will restart dataloader.

Thanks for following along, happy to close the issue!