Overview of issues in torchtext and the plan for revamping
See original GitHub issueMotivation and summary of the current issues in torchtext
Based on the feedback from users, there are several issues existing in torchtext, including
- Several components and functionals are unclear and difficult to adopt. For example,
Field
class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The currentField
class works as a “black box”, and users are confused about what’s going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library, which grants users the freedom to build the models and pipelines with orthogonal components. - Incompatible with DataLoader and Sampler in torch.utils.data. The current datasets in torchtext are not compatible with PyTorch core library. Some custom modules/functions in torchtext (e.g.
Iterator
,Batch
,splits
) should be replaced by the corresponding modules intorch.utils.data
.
New datasets in torchtext.experimental.datasets
We have re-written several datasets in torchtext.experimental.datasets
which were using the new abstractions. The old version of the datasets are still available in torchtext.datasets
and the new datasets are opt-in.
- Sentiment analysis dataset (https://github.com/pytorch/text/pull/651)
- IMDB
- Language modeling datasets (https://github.com/pytorch/text/pull/624), including
- WikiText2
- WikiText103
- PennTreebank
Case study for IMDB dataset
API for new datasets
To load the new datasets, simply call the dataset API, as follow:
from torchtext.experimental.datasets import IMDB
train_dataset, test_dataset = IMDB()
To specify a tokenizer:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_dataset, test_dataset = IMDB(tokenizer=tokenizer)
If you just need the test set (must pass a Vocab
object!):
vocab = train_dataset.get_vocab()
test_dataset, = IMDB(tokenizer=tokenizer, vocab=vocab, data_select='test')
Legacy code
The old IMDB dataset is still available in the folder torchtext.datasets
. You can use the legacy datasets, as follow:
import torchtext.data as data
TEXT = torchtext.data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = torchtext.data.Field(sequential=False)
train, test = torchtext.datasets.IMDB.splits(TEXT, LABEL)
Difference
With the old pattern, users have to create a Field
object including a specific tokenizer. In the new dataset API, user can pass a custom tokenizer directly to the dataset constructor. A custom tokenizer defines the method to convert a string to a list of tokens
from torchtext.data.utils import get_tokenizer
# Old pattern
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"))
# New pattern
train_dataset, test_dataset = IMDB(tokenizer=get_tokenizer("spacy"))
In the old dataset, vocab
object is associated with Field
class, which is not flexible enough to accept a pre-trained vocab
object. In the new dataset, the vocab
object can be obtained by
vocab = train_dataset.get_vocab()
new_vocab = torchtext.vocab.Vocab(counter=vocab.freqs, max_size=1000, min_freq=10)
and apply to generate other new datasets.
from torchtext.experimental.datasets import WikiText2
train_dataset, test_dataset, valid_dataset = WikiText2(vocab=new_vocab)
The datasets with the new pattern return a tensor of token IDs, instead of tokens in the old pattern. If users would like to retrieve the tokens, simply use the following command:
train_vocab = train_dataset.get_vocab()
# label and text are saved as a tuple
tokens = [train_vocab.itos[id] for id in train_dataset[0][1]]
Unlike the old pattern using BucketIterator.splits
, users are encouraged to use torch.utils.data.DataLoader
to generate batches of data. You can specify how to batch and pad the samples with a custom function passed to collate_fn
. Here is an example to pad sequences with similar lengths and load data through DataLoader
. To generate random samples, turn on the shuffle
flag in DataLoader
. Otherwise, a sequential sampler will be automatically constructed.
# Generate a list of tuples of text length, index, label, text
data_len = [(len(txt), idx, label, txt) for idx, (label, txt) in enumerate(train_dataset)]
data_len.sort() # sort by length and pad sequences with similar lengths
# Generate the pad id
pad_id = train_dataset.get_vocab()['<pad>']
# Generate 8x8 batches
# Pad sequences with similar lengths
import torch
from torch.utils.data import DataLoader
def pad_data(data):
# Find max length of the mini-batch
max_len = max(list(zip(*data))[0])
label_list = list(zip(*data))[2]
txt_list = list(zip(*data))[3]
padded_tensors = torch.stack([torch.cat((txt, \
torch.tensor([pad_id] * (max_len - len(txt))).long())) \
for txt in txt_list])
return padded_tensors, label_list
dataloader = DataLoader(data_len, batch_size=8, collate_fn=pad_data)
for idx, (txt, label) in enumerate(dataloader):
print(idx, txt.size(), label)
Randomly split a dataset into non-overlapping new datasets of given lengths.
from torchtext.experimental.datasets import IMDB
train_dataset, test_dataset = IMDB()
train_subset, valid_subset = torch.utils.data.random_split(train_dataset, [15000, 10000])
Reference:
A few recent issues from OSS users:
- Sorting sentence within a batch is confusing #641
split
function is confusing #644- Generate
vocab
object based on a subset of text file #642 - Pass a pre-trained
vocab
object to build a dataset #648 - Load unconstructed text data #649
- More flexibility to support word vector layers #650
- More compatible with
torch.utils.data.DataLoader
#660
Issue Analytics
- State:
- Created 4 years ago
- Reactions:30
- Comments:40 (24 by maintainers)
Top GitHub Comments
Here’s a minimal example of how to use your own data - here given as a very small list - to create a
TextClassificationDataset
:The only missing steps to apply this to actual data would be to add some code that loads your data into the list of (label, text) tuples. Any pre-processing desired can be handled by writing your own custom tokenizer function or any other functions that will fit within the
sequential_transforms
.Here’s some feedback after playing around with the new
experimental
API. First, I’d also like to say that the new API is great - makes for very clean code - the addition oftransforms
was a good idea and makes it a lot easier to use torchtext with other libraries, such as the huggingface transformers.As for the feedback:
Experimental vocab should take a
max_size
argument. An integer denoting the maximum size of the created vocabulary. It is more common to build a vocabulary up to a maximum size rather than to set a minimum frequency of tokens, although I still believemin_freq
should remain. This should be an argument instead of the user cutting theordered_dict
tomax_size
so it can be passed to functions such asbuild_vocab_from_iterator
,vocab_from_file
etc. This means that some sort of sorting with respect to token frequency will have to be done internally in the Vocab C++ class(?)Experimental vocab should take a
specials
argument. A list of strings, each representing a token that will always be in the vocabulary, regardless of how many times it appears in theordered_dict
and each should be appended to the vocabulary after the<unk>
and<pad>
tokens but before the rest of the tokens. Used for adding<sos>
,<eos>
,<mask>
tokens. Again, this should be an argument so they can be passed tobuild_vocab_from_iterator
, etc.Experimental vocab’s
unk_token
argument should be optional and the vocab object should raise an error if the user tries to lookup a token that isn’t in the vocab whenunk_token
is not set. This is useful when building a vocabulary for labels which is easy to do with the new vocab transform API.Experimental vocab’s
pad_token
argument, from here, should also be optional. Again, for building a label vocabulary. I do believe thepad_token
should be its own argument and not be inspecials
as it was in the legacy vocab.Experimental functional transforms should be imported with
experimental.transforms.functional
and notexperimental.functional
, i.e. there should be atransforms
dir intorchtext/experimental
with atransforms.py
and afunctional.py
in it. This mirrors the way it is done in torchvision.Experimental vocab’s arguments should be set as attributes of the vocab object. For example, I should be able to create a vocabulary and call
vocab.unk_token
to get the vocabulary’sunk_token
, the same withvocab.pad_token
,vocab.min_freq
,vocab.max_size
,vocab.specials
, etc.Experimental’s vector’s
unk_tensor
argument should be either a tensor or callable which returns a tensor. At the moment I can’t initialize the vector’s oov tokens from something like a uniform or Normal distribution without them all being the exact same tensor.Experimental raw text classification datasets, especially IMDB, are not actually “raw”. If I’m getting the raw IMDB data then I want the labels to be “neg”/“pos” and not 0/1. This line is explicitly not making the data “raw” anymore. This would mean, for consistency, that the other text classification datasets should also have their “raw” labels, from here. However, as they are actually already stored with their labels as integers then maybe it’s a bit weird to transform them back into strings. Not sure about this one.
vocab_from_raw_text_file
,vocab_from_file
andvocab
inexperimental.vocab
should be renamedbuild_vocab_from_raw_text_file
,build_vocab_from_file
andbuild_vocab
. The first two for consistency withbuild_vocab_from_iterator
and the last one to avoid confusion with theVocab
class/object. It also explains what these functions do a bit better.Happy to discuss all of these and help with any pull requests if needed.