question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them

See original GitHub issue

🚀 Feature

The experimental text_classification datasets should have a way to build a validation set from them, without the vocabulary being built over the validation set.

Motivation

In ML, you should always have a test, validation and training set. In NLP, your vocabulary should be built from the training set only, and not from the test/validation.

The current experimental text classification (IMDB) dataset does not have a validation set and automatically builds the vocabulary whilst loading the train/test sets. After loading the train and test sets, we would need to construct a validation set with torch.utils.data.random_split. The issue here is that our vocabulary has already been built over the validation set we are about to create. There is currently no way to solve this issue.

Pitch

'valid' should be accepted as a data_select argument and should create a validation set before the vocabulary has been created over the training set. As the IMDB dataset does not have a standardized validation split, we can do something like taking the last 20% of the training set.

I am proposing something like the following after the iters_group is created here:

from itertools import islice, tee

if 'valid' in iters_group.keys():
    train_iter_a, train_iter_b, train_iter_c = tee(iters_group['train'], 3)
    len_train = int(sum(1 for _ in train_iter_a) * 0.8)
    iters_group['valid'] = islice(train_iter_b, len_train, None)
    iters_group['train'] = islice(train_iter_c, 0, len_train)
    iters_group['vocab'] = islice(iters_group['vocab'] 0, len_train)

tee duplicates generators and islice slices into generators. We need to duplicate the training data iterator as we will be using it three times. We use the first iterator to get the length of the training set so we know what size the validation set will be (the last 20% of the examples from the training set). We then use islice to get the last 20% of the training examples to form the validation set, the first 80% of the training examples to use as the new training set, and the first 80% examples of the “vocab” set as this needs to match the training set as it is what we want to build our vocab from.

We can now correctly load a train, valid and test set with vocabulary built only over the training set:

from torchtext.experimental import datasets

train_data, valid_data, test_data = datasets.IMDB(data_select=('train', 'valid', 'test'))

Can also load a custom vocabulary built from the original vocabulary like so (note that ‘valid’ needs to be in the data_select when building the original vocabulary):

from torchtext import vocab
from torchtext.experimental import datasets

def get_IMDB(root, tokenizer, vocab_max_size, vocab_min_freq):
    
    os.makedirs(root, exist_ok = True)
    
    train_data, _ = datasets.IMDB(tokenizer = tokenizer, 
                                 data_select = ('train', 'valid'))
    
    old_vocab = train_data.get_vocab()
    
    new_vocab = vocab.Vocab(old_vocab.freqs, 
                            max_size = vocab_max_size, 
                            min_freq = vocab_min_freq)
    
    train_data, valid_data, test_data = datasets.IMDB(tokenizer = tokenizer, 
                                                      vocab = new_vocab,
                                                      data_select=('train', 'valid', 'test'))
    
    return train_data, valid_data, test_data

Happy to make the PR if this is given the go-ahead.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
bentrevettcommented, Feb 5, 2020

One way of dealing with this would be to modify text classification to return the raw text instead of building a vocab if it doesn’t exist. That way you’d get a training and testing dataset that yields the lines of text (in UTF-8) format, which could then be fed into a vocab factory.

I would prefer this over my proposed solution.

0reactions
zhangguanheng66commented, Apr 21, 2020

fixed in #701

Read more comments on GitHub >

github_iconTop Results From Across the Web

Split Your Dataset With scikit-learn's train_test_split()
Training, Validation, and Test Sets. Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it's enough to ......
Read more >
Step 3: Prepare Your Data | Machine Learning
A simple best practice to ensure the model is not affected by data order is to always shuffle the data before doing anything...
Read more >
Train-Test Split for Evaluating Machine Learning Algorithms
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is...
Read more >
Text Classification with Extremely Small Datasets
In this blog, we'll simulate a scenario where we only have access to a very small dataset and explore this concept at length....
Read more >
Chapter 4. Text Classification - O'Reilly
Split the dataset into two (training and test) or three parts: training, validation ... However, it's possible to build a simple classifier without...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found