Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton)

See original GitHub issue

This is the initial draft. I will complete it shortly.

State of Iterator is attached to each IterDataPipe instance. This is super useful for:

Determinism
Snapshotting
Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.

Implementation Options:

Each DataPipe has an attribute of _iterator as the place holder for __iter__ calls.
Implement __next__. (My Preference)
- It would make the instance pickable. Previously generator function (__iter__) is not picklable -> Help multiprocessing and snapshotting)
- __iter__ return self (Forker(self) may be another option, not 100% sure)
- IMO, this is super useful as we can track the number of __next__ call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from __iter__, which we couldn’t track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached to self instance)
- As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn’t match the actual execution graph.

DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork) Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.

cc: @VitalyFedyunin @NivekT

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:22 (22 by maintainers)

Top GitHub Comments

1reaction

NivekTcommented, Jun 10, 2022

Note to future readers, these linked PRs contain BC-breaking notes that describe the behavior before and after in details:

1reaction

ejguancommented, Apr 20, 2022

Keep the existing DataPipes implementations as they are, but we will add checks to invalidate an old iterator when a new one is created (i.e. always be in singleton mode)

For the PR, could you verify all our existing customers’ code would behave normally?

2. Update the documentation to recommend users to return self within __iter___ for custom IterDataPipe

Just ignore my argument above. I misunderstood the approach in the PR. We can leave the API as it is. But, we need to document the behavior of the singleton iterator.