Standardization of the datasets
See original GitHub issueThis is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.
Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.
Proposed Structure
This issues presents the idea to standardize the torchvision.datasets
. This could be done by adding parameters to the VisionDataset
(split
) or by subclassing it and add task specific parameters (classes
or class_to_idx
) to the new class
es. I imagine it something like this:
import torch.utils.data as data
class VisionDataset(data.Dataset):
pass
class ClassificationDataset(VisionDataset):
pass
class SegmentationDataset(VisionDataset):
pass
class DetectionDataset(VisionDataset):
pass
For our tests we could then have a generic_*_dataset_test
as is already implement for ClassificationDataset
s.
VisionDataset
-
As discussed in #1067 we could unify the argument that selects different parts of the dataset. IMO
split
as astr
is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are theFakedata
andImageFolder
datasets, which will be discussed separately. -
IMO every dataset should have a
_download
method in order to be useful for every user of this package. We could have the constructor havedownload=True
as keyword argument and call thedownload
method within it. As above, theFakedata
andImageFolder
datasets will be discussed below.
Fakedata
and ImageFolder
What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the VisionDataset
as stated above. I propose that we simply remove the split
and download
argument from their constructor and raise an exception if someone calls the download
method.
Furthermore the Fakedata
dataset is currently a ClassificationDataset
. We should also create a FakeSegmentationData
and a FakeDetectionData
dataset.
ClassificationDataset
The following datasets belong to this category: CIFAR*
, ImageNet
, *MNIST
, SVHN
, LSUN
, SEMEION
, STL10
, USPS
, Caltech*
- Each dataset should return
PIL.Image, int
if indexed - Each dataset should have a
classes
parameter, which is atuple
with all available classes in human-readable form - Currently, some datasets have a
class_to_idx
parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create aidx_to_class
parameter, since IMO this is the far more common transformation.
SegmentationDataset
The following datasets belong to this category: VOCSegmentation
DetectionDataset
The following datasets belong to this category: CocoDetection
, VOCDetection
ToDo
- The following datasets need sorting into the three categories: ~
Caltech101
~, ~Caltech256
~,CelebA
,CityScapes
,Cococaptions
,Flickr8k
,Flickr30k
, ~LSUN
~,Omniglot
,PhotoTour
,SBDataset
(shouldn’t this be just calledSBD
?),SBU
, ~SEMEION
~, ~STL10
~, and ~USPS
~ - ~Add some common arguments / parameters for the
SegmentationDataset
andDetectionDataset
~
Thoughts and suggestions?
Issue Analytics
- State:
- Created 4 years ago
- Comments:23 (10 by maintainers)
@fmassa I’ve mentioned in #963 that using a Dataset subclass to simply index all the files and offload the rest of the work to Transforms seems to be reasonably flexible.
Also I want to stress that the problem right now is that most research projects that utilize public datasets write their own wrappers (which vary a lot in readability and speed of execution) as there are no general guidelines on how to build those and there’s no single place to look if somebody else has done the work earlier.
It’s unreasonable to spend this much time rewriting pretty much the same functionality. I’ve started a discussion on PyTorch forum, but it didn’t really lead anywhere yet. I propose to create an addition to Hub or even a simple github repo that would house wrappers for research datasets
Each dataset’s README would tell you:
transforms.py
should provide dataset-specific transforms andviz_tools.py
- functions to visualize the annotation.Point being there should be a single place where people can look for dataset wrappers that are ready to be used, because in my practice it takes a lot of time to build those for yourself.
Let me know what you think
@zhangguanheng66
transform
s as well as theroot
of the dataset are already part of the abstract dataset class calledVisionDataset
. I agree on the other part. We could do something like this: