How to split `_RawTextIterableDataset`
See original GitHub issue❓ Questions and Help
I am trying to move from using legacy
and use new provided features, i was doing this:
from torchtext import legacy
TEXT = legacy.data.Field(lower=True, batch_first=True)
LABEL = legacy.data.LabelField(dtype=torch.float)
train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')
train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))
But now i want to split train_data, how can i do that?
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))
# I need to split train_iter into train_iter and valid_iter
And i think providing more features more than just this one would help more, Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
How to use Pytorch Dataloaders to work with enormously ...
Hence we can define the IterableDataset class for this problem as: Here we first create 2 separate iterators for both the files, then...
Read more >torchtext.data - Read the Docs
Datasets for train, validation, and test splits in that order, if the splits are provided. ... Create Dataset objects for multiple splits of...
Read more >How do I split an iterable dataset into training and test datasets?
Technically you can just set a goal ratio, and start collecting items into two lists randomly using that ratio. The result won't be...
Read more >Text classification with the torchtext library - PyTorch Tutorials
Those are the basic data processing building blocks for raw text string. ... It also works with an iterable dataset with the shuffle...
Read more >Text classification with the torchtext library - PyTorch
Build data processing pipeline to convert the raw text strings into torch. ... torchtext.datasets import AG_NEWS train_iter = iter(AG_NEWS(split='train')).
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’s an iterator so I don’t think you can split/shuffle it. I think, it’s worth an option to set up the offset or the beginning of line. So for the valid set, you can start from a different line. cc @cpuhrsch @parmeet
it depends on how you want to split. For a simple case, you can use
demux
to split based on the indices generated by enumerating from the prior DataPipe.