question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BucketIterator iterating more than the length

See original GitHub issue

I have code like this:

batch_size = 100

train_dataset, test_dataset = convos.split(split_ratio=0.7)

train_iterator = torchtext.data.BucketIterator(
    train_dataset,
    batch_size=3000,
    sort_key=lambda x: torchtext.data.interleave_keys(len(x.context), len(x.response)),
    device=device
)

print("Batch size: ", batch_size)
print("Train size: ", len(train_iterator))
Batch size:  100
Train size:  1

Following this, I do

for count, batch in enumerate(train_iterator):
    print(count)

And this program doesn’t stop (at least I checked until count = 3000). I suppose it should iterate just once!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
mttkcommented, Sep 25, 2018

Yeah, I believe it is more or less agreed that the default infinite loop in Iterators is confusing.

I’d say that the expected behavior is to iterate for one epoch (so, the default would be repeat=False), and leave the infinite loop as an option if users prefer it. This might break something backwards but it’s manageable.

0reactions
zhangguanheng66commented, May 30, 2019

I will just close this issue since a PR is attached. Thanks all for the help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Better Batches with PyTorchText BucketIterator | Medium
Using PyTorch Dataset with PyTorchText Bucket Iterator: Here I implemented a ... substantially more data than previous benchmark datasets.
Read more >
PyTorchText BucketIterator - George Mihaila
The purpose is to use an example text datasets and batch it using PyTorchText with BucketIterator and show how it groups text sequences...
Read more >
BucketIterator not returning batches of correct size
I expect a batch from this iterator to have the shape (batch_size, max_len) , but it appends the entire corpus into 1 tensor...
Read more >
Efficient bucketing implementation - Hugging Face Forums
Which is the most efficient way to create batches with sequences of similar length to minimize padding in HF datasets?
Read more >
torchtext.data - Read the Docs
Every dataset consists of one or more types of data. ... If the field has include_lengths=True, a tensor of lengths will be included...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found