question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

[RFC] Dataset builders: to have or not to have?

See original GitHub issue

šŸš€ Feature

The current design in torchtext presents the user with two APIs for dataset construction:

  • the ā€œrawā€ API, which returns the raw text data from the dataset, and
  • the one-liner builder API that applies tokenization + vocab mapping + returns train + val + test datasets.

While I understand that building the vocabulary might be annoying, I think that it is important to have one recommended way of doing things in torchtext. The one-liner API solves a few problems well, but for maximum flexibility the user might need the raw API. But if the user is only used to the one-liner API, switching to the raw API might be non-trivial.

I propose that we instead favor the raw API, and illustrate with examples and tutorials how the vocabulary etc should be done.

Here are two examples. Iā€™m using Map-style dataset for simplicity, deciding between map-style or iterable-style datasets is a topic for a different discussion

Example 1: Text classification

This is one example of what I would propose for a text classification dataset

class AGNews:
    def __init__(self, ..., src_transform=None, tgt_transform=None):
        self.src_transform = src_transform
        self.tgt_transform = tgt_transform

    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            label = self.tgt_transform(label)

        return src, label

Then, the user would use the dataset in the following way

tokenizer = get_default_tokenizer(lang="en")
raw_dataset = AGNews(..., src_transform=tokenizer)
# or the equivalent API
vocab = build_vocab(raw_dataset)
# user can cache the vocab if they want
# or combine multiple datasets via ConcatDataset
# before creating the vocab, etc
...

# now create the datasets used for training
dataset_train = AGNews(..., split='train', src_transform=Compose([tokenizer, vocab]))
dataset_test = AGNews(..., split='test', src_transform=Compose([tokenizer, vocab]))

The current proposal adds two extra lines overhead to the user, but it teaches the user how to use torchtext for doing whatever they need.

Example 2: Translation

Here is an example for a translation dataset

class EnFrTranslation:
    def __getitem__(self, idx):
        ...
        if self.src_transform is not None:
            src = self.src_transform(src)
        if self.tgt_transform is not None:
            tgt = self.tgt_transform(tgt)
        return src, tgt

And the user then do in their code

tok_en = get_default_tokenizer(lang="en")
tok_fr = get_default_tokenizer(lang="fr")

# source data for creating the vocabulary
# can be the same dataset or a completely different one
# but it's explicit to the user on how they can obtain
# different vocabs, for example from unsupervised
# datasets where we don't have pairings
raw_dataset = EnFrTranslation(..., src_transform=tok_en, tgt_transform=tok_fr)

# build the vocabulary for each language
# independently
vocab_en, vocab_fr = Vocab(), Vocab()
for idx in range(len(raw_dataset)):
    src, tgt = raw_dataset[idx]
    vocab_en.add(src)
    vocab_fr.add(tgt)
vocab_en.finalize()
vocab_fr.finalize()

# now create the datasets used for training
# the model
dataset_train = EnFrTranslation(..., split='train',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)
dataset_test = EnFrTranslation(..., split='test',
        src_transform=Compose([tok_en, vocab_en]),
        tgt_transform=Compose([tok_fr, vocab_fr])
)

Verbose, but explicit

While with the proposed way of laying down datasets makes it a bit more verbose for users to get started, the intent is clear for the beginning. A vocabulary is nothing more than a data transformation (akin to the transforms we use in torchvision), with the subtlety that it needs to be ā€œtrainedā€, and how we ā€œtrainā€ it is independent on the dataset.

One benefit for this being explicit is that the user has less opportunity to shoot themselves on the foot. As an example, re-using the same vocab while changing the tokenizer is a silent error with the ā€œone-linerā€ API, as there is nothing that we can do to prevent them from mixing up different tokenizer and vocab. One could have expected it to just magically work.

Making it explicit puts the burden on the user on how to cache vocab and how to let the user perform the transformations themselves, and not on the library maintainer.

Towards Vocab within the model?

The above proposal also makes it clear that the vocabulary (and tokenizer) could be part of the model instead of the dataset. Indeed, a model is tightly coupled with a vocab (and the tokenizer as well), so once we have an efficient pmap in PyTorch we could just move them to the model. But thatā€™s a topic for a separate discussion.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bentrevettcommented, Nov 2, 2020

trying

@bentrevett Yup. I think the way you did with raw_data_to_dataset func is what we want users to learn and handle (check out raw datasets, build transform pipeline). And itā€™s consistent with what @fmassa proposes here. FYI, pytorch provides a dataset split func: torch.utils.data.random_split.

Additionally, we would like to hear more opinions for ā€œdataset buildersā€. For example, AG_NEWS in text classification or WikiText2 in language modeling. Do you think those ā€œdataset buildersā€ are still useful or we should just provide the APIs for the raw datasets.

I think removing the builders is fine, as more datasets getting added there will have to be more and more edge cases to be written into the setup_datasets. Also means less code for torchtext to maintain.

Iā€™d still keep the TextClassificationDataset, LanguageModelingDataset, etc. classes though, and then for each of them a have a short code example that would be similar to the setup_datasets functionality there is now.

0reactions
zhangguanheng66commented, Nov 2, 2020

As part of the discussion, I put together a short review for the heterogeneity of the text datasets:

  • text classification datasets have texts and labels. For example, the AG_NEWS dataset (link) is a csv file in which labels and texts are separated by comma. IMDB dataset (link) contains thousands of individual files and each file has a single sentence. Positive and negative reviews are grouped in a separate folder, respectively.
  • language modeling datasets usually have multiple files with text sentences. For example, WikiText2 dataset (link) is a text file. BookCorpus dataset (see FAIR cluster /datasets01/bookcorpus/021819/) has multiple files, and each file represents a book.
  • question answer datasets have context/questions/answer/answer-positions in a JSON file (link).
  • translation datasets have a pair of source and target sentences saved in two separate files (en vs fr).
Read more comments on GitHub >

github_iconTop Results From Across the Web

Some thoughts on system design to facilitate resource sharing
I think that this is fine for experimental use, but is not the way we want to operate in real usage. What I...
Read more >
RFC 9082 - Registration Data Access Protocol (RDAP) Query ...
Registration Data Access Protocol (RDAP) Query Format (RFC 9082, June 2021)
Read more >
Remote Function Call (RFC) in SAP Tutorial - Guru99
This means that an RFC connection you have defined in client 000 can ... It is not reliable for communication since data may...
Read more >
RFC: HDF5 Virtual Dataset - The HDF Group
This document introduces Virtual Datasets (VDS) for HDF5 and ... If some of the sub-ā€images at a particular time step have not been...
Read more >
DEFLATE Compressed Data Format Specification version 1.3
There have been no technical changes to the deflate format since version 1.1 of this ... Version 1.3 is a conversion of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found