question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton)

See original GitHub issue

This is the initial draft. I will complete it shortly.

State of Iterator is attached to each IterDataPipe instance. This is super useful for:

  • Determinism
  • Snapshotting
  • Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.

Implementation Options:

  • Each DataPipe has an attribute of _iterator as the place holder for __iter__ calls.
  • Implement __next__. (My Preference)
    • It would make the instance pickable. Previously generator function (__iter__) is not picklable -> Help multiprocessing and snapshotting)
    • __iter__ return self (Forker(self) may be another option, not 100% sure)
    • IMO, this is super useful as we can track the number of __next__ call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from __iter__, which we couldn’t track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached to self instance)
    • As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn’t match the actual execution graph.

DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork) Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.

cc: @VitalyFedyunin @NivekT

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:22 (22 by maintainers)

github_iconTop GitHub Comments

1reaction
NivekTcommented, Jun 10, 2022

Note to future readers, these linked PRs contain BC-breaking notes that describe the behavior before and after in details:

1reaction
ejguancommented, Apr 20, 2022
  1. Keep the existing DataPipes implementations as they are, but we will add checks to invalidate an old iterator when a new one is created (i.e. always be in singleton mode)

For the PR, could you verify all our existing customers’ code would behave normally?

2. Update the documentation to recommend users to return self within __iter___ for custom IterDataPipe

Just ignore my argument above. I misunderstood the approach in the PR. We can leave the API as it is. But, we need to document the behavior of the singleton iterator.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Iterable-style DataPipes — TorchData main documentation
Only one iterator can be valid for each IterDataPipe at a time, and the creation a second iterator will invalidate the first one....
Read more >
dual iterator in one python object - Stack Overflow
In python, I am trying to write a class that support two different kind of iterator. Roughly speaking, this object contains a matrix...
Read more >
A PyTorch repo for data loading and utilities to be shared by ...
[RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton). This is the initial draft. I will complete it shortly. State of Iterator...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found