[RFC] Dataset builders: to have or not to have?
See original GitHub issueš Feature
The current design in torchtext presents the user with two APIs for dataset construction:
- the ārawā API, which returns the raw text data from the dataset, and
- the one-liner builder API that applies tokenization + vocab mapping + returns train + val + test datasets.
While I understand that building the vocabulary might be annoying, I think that it is important to have one recommended way of doing things in torchtext. The one-liner API solves a few problems well, but for maximum flexibility the user might need the raw API. But if the user is only used to the one-liner API, switching to the raw API might be non-trivial.
I propose that we instead favor the raw API, and illustrate with examples and tutorials how the vocabulary etc should be done.
Here are two examples. Iām using Map-style dataset for simplicity, deciding between map-style or iterable-style datasets is a topic for a different discussion
Example 1: Text classification
This is one example of what I would propose for a text classification dataset
class AGNews:
def __init__(self, ..., src_transform=None, tgt_transform=None):
self.src_transform = src_transform
self.tgt_transform = tgt_transform
def __getitem__(self, idx):
...
if self.src_transform is not None:
src = self.src_transform(src)
if self.tgt_transform is not None:
label = self.tgt_transform(label)
return src, label
Then, the user would use the dataset in the following way
tokenizer = get_default_tokenizer(lang="en")
raw_dataset = AGNews(..., src_transform=tokenizer)
# or the equivalent API
vocab = build_vocab(raw_dataset)
# user can cache the vocab if they want
# or combine multiple datasets via ConcatDataset
# before creating the vocab, etc
...
# now create the datasets used for training
dataset_train = AGNews(..., split='train', src_transform=Compose([tokenizer, vocab]))
dataset_test = AGNews(..., split='test', src_transform=Compose([tokenizer, vocab]))
The current proposal adds two extra lines overhead to the user, but it teaches the user how to use torchtext for doing whatever they need.
Example 2: Translation
Here is an example for a translation dataset
class EnFrTranslation:
def __getitem__(self, idx):
...
if self.src_transform is not None:
src = self.src_transform(src)
if self.tgt_transform is not None:
tgt = self.tgt_transform(tgt)
return src, tgt
And the user then do in their code
tok_en = get_default_tokenizer(lang="en")
tok_fr = get_default_tokenizer(lang="fr")
# source data for creating the vocabulary
# can be the same dataset or a completely different one
# but it's explicit to the user on how they can obtain
# different vocabs, for example from unsupervised
# datasets where we don't have pairings
raw_dataset = EnFrTranslation(..., src_transform=tok_en, tgt_transform=tok_fr)
# build the vocabulary for each language
# independently
vocab_en, vocab_fr = Vocab(), Vocab()
for idx in range(len(raw_dataset)):
src, tgt = raw_dataset[idx]
vocab_en.add(src)
vocab_fr.add(tgt)
vocab_en.finalize()
vocab_fr.finalize()
# now create the datasets used for training
# the model
dataset_train = EnFrTranslation(..., split='train',
src_transform=Compose([tok_en, vocab_en]),
tgt_transform=Compose([tok_fr, vocab_fr])
)
dataset_test = EnFrTranslation(..., split='test',
src_transform=Compose([tok_en, vocab_en]),
tgt_transform=Compose([tok_fr, vocab_fr])
)
Verbose, but explicit
While with the proposed way of laying down datasets makes it a bit more verbose for users to get started, the intent is clear for the beginning. A vocabulary is nothing more than a data transformation (akin to the transforms we use in torchvision), with the subtlety that it needs to be ātrainedā, and how we ātrainā it is independent on the dataset.
One benefit for this being explicit is that the user has less opportunity to shoot themselves on the foot. As an example, re-using the same vocab while changing the tokenizer is a silent error with the āone-linerā API, as there is nothing that we can do to prevent them from mixing up different tokenizer and vocab. One could have expected it to just magically work.
Making it explicit puts the burden on the user on how to cache vocab and how to let the user perform the transformations themselves, and not on the library maintainer.
Towards Vocab within the model?
The above proposal also makes it clear that the vocabulary (and tokenizer) could be part of the model instead of the dataset. Indeed, a model is tightly coupled with a vocab (and the tokenizer as well), so once we have an efficient pmap in PyTorch we could just move them to the model. But thatās a topic for a separate discussion.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
Top GitHub Comments
I think removing the builders is fine, as more datasets getting added there will have to be more and more edge cases to be written into the
setup_datasets
. Also means less code for torchtext to maintain.Iād still keep the
TextClassificationDataset
,LanguageModelingDataset
, etc. classes though, and then for each of them a have a short code example that would be similar to thesetup_datasets
functionality there is now.As part of the discussion, I put together a short review for the heterogeneity of the text datasets:
AG_NEWS
dataset (link) is a csv file in which labels and texts are separated by comma.IMDB
dataset (link) contains thousands of individual files and each file has a single sentence. Positive and negative reviews are grouped in a separate folder, respectively.WikiText2
dataset (link) is a text file.BookCorpus
dataset (see FAIR cluster/datasets01/bookcorpus/021819/
) has multiple files, and each file represents a book.