question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to split `_RawTextIterableDataset`

See original GitHub issue

❓ Questions and Help

I am trying to move from using legacy and use new provided features, i was doing this:

from torchtext import legacy
TEXT = legacy.data.Field(lower=True, batch_first=True)
LABEL = legacy.data.LabelField(dtype=torch.float)
train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')
train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))

But now i want to split train_data, how can i do that?

from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
# I need to split train_iter into train_iter and valid_iter

And i think providing more features more than just this one would help more, Thanks!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
zhangguanheng66commented, Mar 30, 2021

It’s an iterator so I don’t think you can split/shuffle it. I think, it’s worth an option to set up the offset or the beginning of line. So for the valid set, you can start from a different line. cc @cpuhrsch @parmeet

1reaction
ejguancommented, Jun 24, 2022

it depends on how you want to split. For a simple case, you can use demux to split based on the indices generated by enumerating from the prior DataPipe.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use Pytorch Dataloaders to work with enormously ...
Hence we can define the IterableDataset class for this problem as: Here we first create 2 separate iterators for both the files, then...
Read more >
torchtext.data - Read the Docs
Datasets for train, validation, and test splits in that order, if the splits are provided. ... Create Dataset objects for multiple splits of...
Read more >
How do I split an iterable dataset into training and test datasets?
Technically you can just set a goal ratio, and start collecting items into two lists randomly using that ratio. The result won't be...
Read more >
Text classification with the torchtext library - PyTorch Tutorials
Those are the basic data processing building blocks for raw text string. ... It also works with an iterable dataset with the shuffle...
Read more >
Text classification with the torchtext library - PyTorch
Build data processing pipeline to convert the raw text strings into torch. ... torchtext.datasets import AG_NEWS train_iter = iter(AG_NEWS(split='train')).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found