question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standardization of the datasets

See original GitHub issue

This is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.

Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.


Proposed Structure

This issues presents the idea to standardize the torchvision.datasets. This could be done by adding parameters to the VisionDataset (split) or by subclassing it and add task specific parameters (classes or class_to_idx) to the new classes. I imagine it something like this:

import torch.utils.data as data

class VisionDataset(data.Dataset):
    pass

class ClassificationDataset(VisionDataset):
    pass

class SegmentationDataset(VisionDataset):
    pass

class DetectionDataset(VisionDataset):
    pass

For our tests we could then have a generic_*_dataset_test as is already implement for ClassificationDatasets.


VisionDataset

  • As discussed in #1067 we could unify the argument that selects different parts of the dataset. IMO split as a str is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are the Fakedata and ImageFolder datasets, which will be discussed separately.

  • IMO every dataset should have a _download method in order to be useful for every user of this package. We could have the constructor have download=True as keyword argument and call the download method within it. As above, the Fakedata and ImageFolder datasets will be discussed below.


Fakedata and ImageFolder

What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the VisionDataset as stated above. I propose that we simply remove the split and download argument from their constructor and raise an exception if someone calls the download method.

Furthermore the Fakedata dataset is currently a ClassificationDataset. We should also create a FakeSegmentationData and a FakeDetectionData dataset.


ClassificationDataset

The following datasets belong to this category: CIFAR*, ImageNet, *MNIST, SVHN, LSUN, SEMEION, STL10, USPS, Caltech*

  • Each dataset should return PIL.Image, int if indexed
  • Each dataset should have a classes parameter, which is a tuple with all available classes in human-readable form
  • Currently, some datasets have a class_to_idx parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create a idx_to_class parameter, since IMO this is the far more common transformation.

SegmentationDataset

The following datasets belong to this category: VOCSegmentation


DetectionDataset

The following datasets belong to this category: CocoDetection, VOCDetection


ToDo

  • The following datasets need sorting into the three categories: ~Caltech101~, ~Caltech256~, CelebA, CityScapes, Cococaptions, Flickr8k, Flickr30k, ~LSUN~, Omniglot, PhotoTour, SBDataset (shouldn’t this be just called SBD?), SBU, ~SEMEION~, ~STL10~, and ~USPS~
  • ~Add some common arguments / parameters for the SegmentationDataset and DetectionDataset~

Thoughts and suggestions?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:23 (10 by maintainers)

github_iconTop GitHub Comments

6reactions
wolterlwcommented, Feb 20, 2020

@fmassa I’ve mentioned in #963 that using a Dataset subclass to simply index all the files and offload the rest of the work to Transforms seems to be reasonably flexible.
Also I want to stress that the problem right now is that most research projects that utilize public datasets write their own wrappers (which vary a lot in readability and speed of execution) as there are no general guidelines on how to build those and there’s no single place to look if somebody else has done the work earlier.

It’s unreasonable to spend this much time rewriting pretty much the same functionality. I’ve started a discussion on PyTorch forum, but it didn’t really lead anywhere yet. I propose to create an addition to Hub or even a simple github repo that would house wrappers for research datasets

.
├── dataset1
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── dataset2
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── dataset3
│   ├── dataset.py
│   ├── transforms.py
│   ├── viz_tools.py
│   └── README
├── CONTRIBUTING
└── README

Each dataset’s README would tell you:

  1. how to download the dataset
  2. what is the expected directory structure
  3. how annotations are structured and encoded
  4. provide visualizations of available annotations

transforms.py should provide dataset-specific transforms and viz_tools.py - functions to visualize the annotation.

Point being there should be a single place where people can look for dataset wrappers that are ready to be used, because in my practice it takes a lot of time to build those for yourself.

Let me know what you think

3reactions
pmeiercommented, Jul 10, 2019

@zhangguanheng66 transforms as well as the root of the dataset are already part of the abstract dataset class called VisionDataset. I agree on the other part. We could do something like this:

class VisionDataset(...):
    ...

    def __getitem__(self, index):
        image, target = self._images[index], self.targets[index]
        if self.transform is not None:
            image, target = self.transform(image, target)
        return image, target

    def __len__(self):
        return len(self._images)

    @property
    def _images(self):
        raise NotImplementedError

    @property
    def _targets(self):
        raise NotImplementedError
Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Standardization: How It's Done & Why It's Important
Data standardization is the process of creating standards and transforming data taken from different sources into a consistent format that ...
Read more >
What is Data Standardization? | Knowledge Center
Data Standardization is a data processing workflow that converts the structure of disparate datasets into a Common Data Format. As part of the...
Read more >
Data Transformation: Standardization vs Normalization
Standardisation. This technique is to re-scale features value with the distribution value between 0 and 1 is useful for the optimization ...
Read more >
Data standardization guide: Types, benefits, and process
A data standardization process has four simple steps: define, test, transform, and retest. Let's go over each step in a bit more detail....
Read more >
What are data standardization and data normalization?
Standardization, also sometimes known as z-score normalization, is a technique just like normalization to rescale the values, but satisfying the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found