Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to split `_RawTextIterableDataset`

See original GitHub issue

❓ Questions and Help

I am trying to move from using legacy and use new provided features, i was doing this:

from torchtext import legacy
TEXT = legacy.data.Field(lower=True, batch_first=True)
LABEL = legacy.data.LabelField(dtype=torch.float)
train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')
train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))

But now i want to split train_data, how can i do that?

from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
# I need to split train_iter into train_iter and valid_iter

And i think providing more features more than just this one would help more, Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

zhangguanheng66commented, Mar 30, 2021

It’s an iterator so I don’t think you can split/shuffle it. I think, it’s worth an option to set up the offset or the beginning of line. So for the valid set, you can start from a different line. cc @cpuhrsch @parmeet

1reaction

ejguancommented, Jun 24, 2022

it depends on how you want to split. For a simple case, you can use demux to split based on the indices generated by enumerating from the prior DataPipe.

Top Results From Across the Web

How to use Pytorch Dataloaders to work with enormously ...

Hence we can define the IterableDataset class for this problem as: Here we first create 2 separate iterators for both the files, then...

torchtext.data - Read the Docs

Datasets for train, validation, and test splits in that order, if the splits are provided. ... Create Dataset objects for multiple splits of...

How do I split an iterable dataset into training and test datasets?

Technically you can just set a goal ratio, and start collecting items into two lists randomly using that ratio. The result won't be...

Text classification with the torchtext library - PyTorch Tutorials

Those are the basic data processing building blocks for raw text string. ... It also works with an iterable dataset with the shuffle...

Text classification with the torchtext library - PyTorch

Build data processing pipeline to convert the raw text strings into torch. ... torchtext.datasets import AG_NEWS train_iter = iter(AG_NEWS(split='train')).