Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Dataset builders: to have or not to have?

See original GitHub issue

🚀 Feature

The current design in torchtext presents the user with two APIs for dataset construction:

the “raw” API, which returns the raw text data from the dataset, and
the one-liner builder API that applies tokenization + vocab mapping + returns train + val + test datasets.

While I understand that building the vocabulary might be annoying, I think that it is important to have one recommended way of doing things in torchtext. The one-liner API solves a few problems well, but for maximum flexibility the user might need the raw API. But if the user is only used to the one-liner API, switching to the raw API might be non-trivial.

I propose that we instead favor the raw API, and illustrate with examples and tutorials how the vocabulary etc should be done.

Here are two examples. I’m using Map-style dataset for simplicity, deciding between map-style or iterable-style datasets is a topic for a different discussion

Example 1: Text classification

This is one example of what I would propose for a text classification dataset

class AGNews:
    def __init__(self, ..., src_transform=None, tgt_transform=None):
        self.src_transform = src_transform
        self.tgt_transform = tgt_transform

    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            label = self.tgt_transform(label)

        return src, label

Then, the user would use the dataset in the following way

tokenizer = get_default_tokenizer(lang="en")
raw_dataset = AGNews(..., src_transform=tokenizer)
# or the equivalent API
vocab = build_vocab(raw_dataset)
# user can cache the vocab if they want
# or combine multiple datasets via ConcatDataset
# before creating the vocab, etc
...

# now create the datasets used for training
dataset_train = AGNews(..., split='train', src_transform=Compose([tokenizer, vocab]))
dataset_test = AGNews(..., split='test', src_transform=Compose([tokenizer, vocab]))

The current proposal adds two extra lines overhead to the user, but it teaches the user how to use torchtext for doing whatever they need.

Example 2: Translation

Here is an example for a translation dataset

class EnFrTranslation:
    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            tgt = self.tgt_transform(tgt)
        return src, tgt

And the user then do in their code

tok_en = get_default_tokenizer(lang="en")
tok_fr = get_default_tokenizer(lang="fr")

# source data for creating the vocabulary
# can be the same dataset or a completely different one
# but it's explicit to the user on how they can obtain
# different vocabs, for example from unsupervised
# datasets where we don't have pairings
raw_dataset = EnFrTranslation(..., src_transform=tok_en, tgt_transform=tok_fr)

# build the vocabulary for each language
# independently
vocab_en, vocab_fr = Vocab(), Vocab()
for idx in range(len(raw_dataset)):
    src, tgt = raw_dataset[idx]
    vocab_en.add(src)
    vocab_fr.add(tgt)
vocab_en.finalize()
vocab_fr.finalize()

# now create the datasets used for training
# the model
dataset_train = EnFrTranslation(..., split='train',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)
dataset_test = EnFrTranslation(..., split='test',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)

Verbose, but explicit

While with the proposed way of laying down datasets makes it a bit more verbose for users to get started, the intent is clear for the beginning. A vocabulary is nothing more than a data transformation (akin to the transforms we use in torchvision), with the subtlety that it needs to be “trained”, and how we “train” it is independent on the dataset.

One benefit for this being explicit is that the user has less opportunity to shoot themselves on the foot. As an example, re-using the same vocab while changing the tokenizer is a silent error with the “one-liner” API, as there is nothing that we can do to prevent them from mixing up different tokenizer and vocab. One could have expected it to just magically work.

Making it explicit puts the burden on the user on how to cache vocab and how to let the user perform the transformations themselves, and not on the library maintainer.

Towards Vocab within the model?

The above proposal also makes it clear that the vocabulary (and tokenizer) could be part of the model instead of the dataset. Indeed, a model is tightly coupled with a vocab (and the tokenizer as well), so once we have an efficient pmap in PyTorch we could just move them to the model. But that’s a topic for a separate discussion.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

bentrevettcommented, Nov 2, 2020

trying

@bentrevett Yup. I think the way you did with raw_data_to_dataset func is what we want users to learn and handle (check out raw datasets, build transform pipeline). And it’s consistent with what @fmassa proposes here. FYI, pytorch provides a dataset split func: torch.utils.data.random_split.

Additionally, we would like to hear more opinions for “dataset builders”. For example, AG_NEWS in text classification or WikiText2 in language modeling. Do you think those “dataset builders” are still useful or we should just provide the APIs for the raw datasets.

I think removing the builders is fine, as more datasets getting added there will have to be more and more edge cases to be written into the setup_datasets. Also means less code for torchtext to maintain.

I’d still keep the TextClassificationDataset, LanguageModelingDataset, etc. classes though, and then for each of them a have a short code example that would be similar to the setup_datasets functionality there is now.

0reactions

zhangguanheng66commented, Nov 2, 2020

As part of the discussion, I put together a short review for the heterogeneity of the text datasets:

text classification datasets have texts and labels. For example, the AG_NEWS dataset (link) is a csv file in which labels and texts are separated by comma. IMDB dataset (link) contains thousands of individual files and each file has a single sentence. Positive and negative reviews are grouped in a separate folder, respectively.
language modeling datasets usually have multiple files with text sentences. For example, WikiText2 dataset (link) is a text file. BookCorpus dataset (see FAIR cluster /datasets01/bookcorpus/021819/) has multiple files, and each file represents a book.
question answer datasets have context/questions/answer/answer-positions in a JSON file (link).
translation datasets have a pair of source and target sentences saved in two separate files (en vs fr).