Revamp TorchText Dataset Testing Strategy

🚀 Feature

Revamp our dataset testing strategy to reduce amount of time spent waiting on tests to complete before merging PRs in torchtext.

Motivation

TorchText dataset testing currently relies on downloading and caching the datasets daily and then running CircleCI tests on the cached data. This can be slow and unreliable for the first PR that kicks off the dataset download and caching. In addition, dataset extraction can be time consuming for some of the larger datasets within torchtext and this extraction process occurs each time the dataset tests are run on a PR. Due to these reasons, tests on CircleCI can take up to an hour to run for each PR whereas vision/audio tests run in mere minutes. We want to revamp our dataset testing strategy in order to reduce the amount of time we spend waiting on tests to complete before merging our PRs in torchtext.

Pitch We need to update the legacy dataset tests within torchtext. Currently we test for things including:

URL link
MD5 hash of the entire dataset
dataset name
number of lines in dataset

Going forward it doesn’t make sense to test the MD5 hash or the number of lines in the dataset. Instead we

Use mocking to test the implementation of our dataset
Use smoke tests for URLs and integrity of data (potentially with Github Actions)

Backlog of Dataset Tests

AG_NEWS #1553
AmazonReviewFull #1561
AmazonReviewPolarity #1532
DBpedia #1566
SogouNews #1576
YelpReviewFull #1567
YelpReviewPolarity #1567
YahooAnswers #1577
CoNLL2000Chunking #1570
UDPOS #1569
IWSLT2016 #1563, #1596
IWSLT2017 #1598
Multi30K #1554
SQuAD1 #1574
SQuAD2 #1575
PennTreebank #1578
WikiText103 #1592
WikiText2 #1592
EnWik9 #1560
IMDB #1579
SST2 #1542
CC-100 #1583

Contributing

We have already implemented a dataset test for AmazonReviewPolarity (#1532) as an example to follow when implementing future dataset tests. Please leave a message below if you plan to work on particular dataset test to avoid duplication of efforts. Also please link to the corresponding PRs.

Follow-Up Items

Encode all strings as utf8 before writing to file when creating mocked data (see https://github.com/pytorch/text/pull/1554#discussion_r795182949)
- #1599
- #1642
Parameterize tests for similar datasets (see https://github.com/pytorch/text/pull/1575#issuecomment-1031901371) #1600
Fix formatting for all dataset tests #1601

Additional Context

We explored how other domains implemented testing for datasets and summarize them below. We will implement our new testing strategy by taking inspiration from TorchAudio and TorchVision

Possible Approaches

Download and cache the dataset daily before running tests (current state of testing)
Create mock data for each dataset (used by torchaudio, and torchvision)
- Requires us to understand the structure of the datasets before creating tests
Store a small portion of the dataset (10 lines) in an assets folder
- Might run into legal problems since we aren’t allowed to host datasets

TorchAudio Approach

Each test lives in its own file
Plan to add integration tests in the future to check dataset URLs
Each test class extends TestBaseMixin and PytorchTestCase (link)
- TestBaseMixin base class provide consistent way to define device/dtype/backend aware TestCase
Each test file contains a get_mock_dataset() method which is responsible for creating the mocked data and saving it to a file in a temp dir (link)
- This method gets called in the setUp classmethod within each test class
The actual test method creates a dataset from the mocked dataset file tests the dataset

TorchVision Approach

All tests live in the test_datasets.py file. This file is really long (1300) and a little hard to read as opposed to seperating tests for each dataset into it’s own file
Testing whether dataset URLs are available and download correctly (link)
CocoDetectionTestCase for the COCO dataset extends the DatasetTestCase base class (link)
- DatasetTestCase is the abstract base class for all dataset test cases and expects child classes to overwrite class attributes such as DATASET_CLASS and FEATURE_TYPES (link)
Here are all the tests from DatasetTestCase that get run for each dataset (link)

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:9 (9 by maintainers)

Top GitHub Comments

4reactions

Nayef211commented, Mar 9, 2022

Thanks @parmeet, @abhinavarora, @erip, and @VirgileHlav for all your help with designing, implementing, and iterating on the mock dataset tests. I’m going to go ahead and close this now that all tasks within the backlog are complete!

3reactions

eripcommented, Jan 30, 2022

I can pick up the IWSLTs

Top Results From Across the Web

torchtext.datasets - PyTorch

Default: ( train , test ). Returns: DataPipe that yields tuple of label (1 to 5) and text containing the review title and...

A Tutorial on Torchtext - Allen Nie

Let's compile a list of tasks that text preprocessing must be able to handle. All checked boxes are functionalities provided by Torchtext. Train ......

Pytorch Torchtext Tutorial 1: Custom Datasets and loading ...

In this video I show you how to to load different file formats (json, csv, tsv) in Pytorch Torchtext using Fields, TabularDataset, ...

Use torchtext to Load NLP Datasets — Part I | by Ceshine Lee

Load the dataset · sequential=True specify that this column holds sequences. · tokenizer=tokenizer specify the tokenizer. · fix_length pads or ...

Sentiment Analysis with LSTM and TorchText with Code and ...

In torchtext we have TabularDataset and it is a very useful class for NLP purposes, which reads the data in any format CSV,...