Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overview of issues in torchtext and the plan for revamping

See original GitHub issue

Motivation and summary of the current issues in torchtext

Based on the feedback from users, there are several issues existing in torchtext, including

Several components and functionals are unclear and difficult to adopt. For example, Field class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The current Field class works as a “black box”, and users are confused about what’s going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library, which grants users the freedom to build the models and pipelines with orthogonal components.
Incompatible with DataLoader and Sampler in torch.utils.data. The current datasets in torchtext are not compatible with PyTorch core library. Some custom modules/functions in torchtext (e.g. Iterator, Batch, splits) should be replaced by the corresponding modules in torch.utils.data.

New datasets in `torchtext.experimental.datasets`

We have re-written several datasets in torchtext.experimental.datasets which were using the new abstractions. The old version of the datasets are still available in torchtext.datasets and the new datasets are opt-in.

Sentiment analysis dataset (https://github.com/pytorch/text/pull/651)
- IMDB
Language modeling datasets (https://github.com/pytorch/text/pull/624), including
- WikiText2
- WikiText103
- PennTreebank

Case study for IMDB dataset

API for new datasets

To load the new datasets, simply call the dataset API, as follow:

from torchtext.experimental.datasets import IMDB
train_dataset, test_dataset = IMDB()

To specify a tokenizer:

from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_dataset, test_dataset = IMDB(tokenizer=tokenizer)

If you just need the test set (must pass a Vocab object!):

vocab = train_dataset.get_vocab()
test_dataset, = IMDB(tokenizer=tokenizer, vocab=vocab, data_select='test')

Legacy code

The old IMDB dataset is still available in the folder torchtext.datasets. You can use the legacy datasets, as follow:

import torchtext.data as data
TEXT = torchtext.data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = torchtext.data.Field(sequential=False)
train, test = torchtext.datasets.IMDB.splits(TEXT, LABEL)

Difference

With the old pattern, users have to create a Field object including a specific tokenizer. In the new dataset API, user can pass a custom tokenizer directly to the dataset constructor. A custom tokenizer defines the method to convert a string to a list of tokens

from torchtext.data.utils import get_tokenizer

# Old pattern
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"))

# New pattern
train_dataset, test_dataset = IMDB(tokenizer=get_tokenizer("spacy"))

In the old dataset, vocab object is associated with Field class, which is not flexible enough to accept a pre-trained vocab object. In the new dataset, the vocab object can be obtained by

vocab = train_dataset.get_vocab()
new_vocab = torchtext.vocab.Vocab(counter=vocab.freqs, max_size=1000, min_freq=10)

and apply to generate other new datasets.

from torchtext.experimental.datasets import WikiText2
train_dataset, test_dataset, valid_dataset = WikiText2(vocab=new_vocab)

The datasets with the new pattern return a tensor of token IDs, instead of tokens in the old pattern. If users would like to retrieve the tokens, simply use the following command:

train_vocab = train_dataset.get_vocab()
# label and text are saved as a tuple
tokens = [train_vocab.itos[id] for id in train_dataset[0][1]]

Unlike the old pattern using BucketIterator.splits, users are encouraged to use torch.utils.data.DataLoader to generate batches of data. You can specify how to batch and pad the samples with a custom function passed to collate_fn. Here is an example to pad sequences with similar lengths and load data through DataLoader. To generate random samples, turn on the shuffle flag in DataLoader. Otherwise, a sequential sampler will be automatically constructed.

# Generate a list of tuples of text length, index, label, text
data_len = [(len(txt), idx, label, txt) for idx, (label, txt) in enumerate(train_dataset)]
data_len.sort() # sort by length and pad sequences with similar lengths

# Generate the pad id
pad_id = train_dataset.get_vocab()['<pad>']

# Generate 8x8 batches
# Pad sequences with similar lengths
import torch
from torch.utils.data import DataLoader
def pad_data(data):
    # Find max length of the mini-batch
    max_len = max(list(zip(*data))[0])
    label_list = list(zip(*data))[2]
    txt_list = list(zip(*data))[3]
    padded_tensors = torch.stack([torch.cat((txt, \
            torch.tensor([pad_id] * (max_len - len(txt))).long())) \
            for txt in txt_list])
    return padded_tensors, label_list

dataloader = DataLoader(data_len, batch_size=8, collate_fn=pad_data)
for idx, (txt, label) in enumerate(dataloader):
    print(idx, txt.size(), label)

Randomly split a dataset into non-overlapping new datasets of given lengths.

from torchtext.experimental.datasets import IMDB
train_dataset, test_dataset = IMDB()
train_subset, valid_subset = torch.utils.data.random_split(train_dataset, [15000, 10000])

Reference:

A few recent issues from OSS users:

Sorting sentence within a batch is confusing #641
split function is confusing #644
Generate vocab object based on a subset of text file #642
Pass a pre-trained vocab object to build a dataset #648
Load unconstructed text data #649
More flexibility to support word vector layers #650
More compatible with torch.utils.data.DataLoader #660

Issue Analytics

State:
Created 4 years ago
Reactions:30
Comments:40 (24 by maintainers)

Top GitHub Comments

6reactions

bentrevettcommented, Oct 5, 2020

Hi all, I have been playing around with the new api for a while. I am wondering is there a way to add custom ‘text_transform’ to input. For example, let’s say I want to transform all str to lowercase or truncate the text str to a certain length. In my opinion, I think that should be passed as an argument so we can append them to the ‘text_transform’. Also, I am wondering why we are still using the old torchtext.vocab instead of the new experimental vocab in the examples? Anyway, I think it’s an interesting change and I am wondering is there anything I can contribute?

Thanks for the comment. For you first question, you should check out the raw text data iterator and build a text transform pipeline. This way will give you more flexibility. For your second comment, we will switch to the new vocabulary once we are done with some cleanup.

Thank you for the reply. I wonder whether the API support us to write a Dataset object for custom dataset? It seems to be hard to do so with the new API. Like the text classification dataset, the build vocab and the transform pipeline is written in the _setup_datasets function, which is not accessible for us if we were to build a custom text classification dataset.

Here’s a minimal example of how to use your own data - here given as a very small list - to create a TextClassificationDataset:

import torch
from torchtext.experimental.datasets.text_classification import TextClassificationDataset
from torchtext.experimental.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torchtext.experimental.functional import sequential_transforms, vocab_func, totensor

# load data from whatever format it's saved in to an iterable of (label, text)
my_data = [('pos', 'this film is great'), ('neg', 'this film is bad'), ('neg', 'this film is awful')]

# tokenizer can be any callable function that goes from str -> list[str]
my_tokenizer = get_tokenizer('basic_english')

# build vocabulary from data
my_vocab = build_vocab_from_iterator([my_tokenizer(text) for label, text in my_data])

# how should the label be transformed?
# str -> int -> LongTensor
label_transforms = sequential_transforms(lambda x: 1 if x == 'pos' else 0, totensor(torch.long))

# how should the text be transformed?
# str -> list[str] -> list[int] -> LongTensor
text_transforms = sequential_transforms(my_tokenizer, vocab_func(my_vocab), totensor(torch.long))

# tuple the transforms
my_transforms = (label_transforms, text_transforms)

# create TextClassificationDataset with data, vocabulary and transforms
dataset = TextClassificationDataset(my_data, my_vocab, my_transforms)

The only missing steps to apply this to actual data would be to add some code that loads your data into the list of (label, text) tuples. Any pre-processing desired can be handled by writing your own custom tokenizer function or any other functions that will fit within the sequential_transforms.

3reactions

bentrevettcommented, Sep 18, 2020

Here’s some feedback after playing around with the new experimental API. First, I’d also like to say that the new API is great - makes for very clean code - the addition of transforms was a good idea and makes it a lot easier to use torchtext with other libraries, such as the huggingface transformers.

As for the feedback:

Experimental vocab should take a max_size argument. An integer denoting the maximum size of the created vocabulary. It is more common to build a vocabulary up to a maximum size rather than to set a minimum frequency of tokens, although I still believe min_freq should remain. This should be an argument instead of the user cutting the ordered_dict to max_size so it can be passed to functions such as build_vocab_from_iterator, vocab_from_file etc. This means that some sort of sorting with respect to token frequency will have to be done internally in the Vocab C++ class(?)
Experimental vocab should take a specials argument. A list of strings, each representing a token that will always be in the vocabulary, regardless of how many times it appears in the ordered_dict and each should be appended to the vocabulary after the <unk> and <pad> tokens but before the rest of the tokens. Used for adding <sos>, <eos>, <mask> tokens. Again, this should be an argument so they can be passed to build_vocab_from_iterator, etc.
Experimental vocab’s unk_token argument should be optional and the vocab object should raise an error if the user tries to lookup a token that isn’t in the vocab when unk_token is not set. This is useful when building a vocabulary for labels which is easy to do with the new vocab transform API.
Experimental vocab’s pad_token argument, from here, should also be optional. Again, for building a label vocabulary. I do believe the pad_token should be its own argument and not be in specials as it was in the legacy vocab.
Experimental functional transforms should be imported with experimental.transforms.functional and not experimental.functional, i.e. there should be a transforms dir in torchtext/experimental with a transforms.py and a functional.py in it. This mirrors the way it is done in torchvision.
Experimental vocab’s arguments should be set as attributes of the vocab object. For example, I should be able to create a vocabulary and call vocab.unk_token to get the vocabulary’s unk_token, the same with vocab.pad_token, vocab.min_freq, vocab.max_size, vocab.specials, etc.
Experimental’s vector’s unk_tensor argument should be either a tensor or callable which returns a tensor. At the moment I can’t initialize the vector’s oov tokens from something like a uniform or Normal distribution without them all being the exact same tensor.
Experimental raw text classification datasets, especially IMDB, are not actually “raw”. If I’m getting the raw IMDB data then I want the labels to be “neg”/“pos” and not 0/1. This line is explicitly not making the data “raw” anymore. This would mean, for consistency, that the other text classification datasets should also have their “raw” labels, from here. However, as they are actually already stored with their labels as integers then maybe it’s a bit weird to transform them back into strings. Not sure about this one.
vocab_from_raw_text_file, vocab_from_file and vocab in experimental.vocab should be renamed build_vocab_from_raw_text_file, build_vocab_from_file and build_vocab. The first two for consistency with build_vocab_from_iterator and the last one to avoid confusion with the Vocab class/object. It also explains what these functions do a bit better.