BucketIterator iterating more than the length
See original GitHub issueI have code like this:
batch_size = 100
train_dataset, test_dataset = convos.split(split_ratio=0.7)
train_iterator = torchtext.data.BucketIterator(
train_dataset,
batch_size=3000,
sort_key=lambda x: torchtext.data.interleave_keys(len(x.context), len(x.response)),
device=device
)
print("Batch size: ", batch_size)
print("Train size: ", len(train_iterator))
Batch size: 100
Train size: 1
Following this, I do
for count, batch in enumerate(train_iterator):
print(count)
And this program doesn’t stop (at least I checked until count = 3000). I suppose it should iterate just once!
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Better Batches with PyTorchText BucketIterator | Medium
Using PyTorch Dataset with PyTorchText Bucket Iterator: Here I implemented a ... substantially more data than previous benchmark datasets.
Read more >PyTorchText BucketIterator - George Mihaila
The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences...
Read more >BucketIterator not returning batches of correct size
I expect a batch from this iterator to have the shape (batch_size, max_len) , but it appends the entire corpus into 1 tensor...
Read more >Efficient bucketing implementation - Hugging Face Forums
Which is the most efficient way to create batches with sequences of similar length to minimize padding in HF datasets?
Read more >torchtext.data - Read the Docs
Every dataset consists of one or more types of data. ... If the field has include_lengths=True, a tensor of lengths will be included...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, I believe it is more or less agreed that the default infinite loop in Iterators is confusing.
I’d say that the expected behavior is to iterate for one epoch (so, the default would be
repeat=False
), and leave the infinite loop as an option if users prefer it. This might break something backwards but it’s manageable.I will just close this issue since a PR is attached. Thanks all for the help.