[RFC] A common dataset root
See original GitHub issueCurrently, all datasets have a mandatory root
parameter, indicating where the dataset will be or has been downloaded.
It would be more convenient if users didn’t need to pass the root, and just rely on some predefined default behaviour. Also, having a default for all datasets will allow places with no internet access (looking at you fbcode 👀) to dump all datasets once and for all at the root, and have a seamless access to it afterwards.
Note that for downloading model weights e.g. using fasterrcnn_resnet50_fpn(pretrained=True)
, we internally rely on load_state_dict_from_url()
which will download the weights in what torch.hub.getdir() returns (by default, this is $TORCH_HOME/hub/
).
(BTW, where the models are downloaded doesn’t seem to be configurable on the torchvision side, but that’s another story)
In scikit-learn, a similar logic is used: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.get_data_home.html
Solution
As previously discussed a bit with @pmeier and @datumbox we could do something similar for torchvision datasets:
- Introduce
torchvision.datasets.getdir()
andtorchvision.datasets.setdir()
. By defaultgetdir()
would return$TORCH_HOME/torchvision_datasets/
, which is consistent with$TORCH_HOME/hub/
. - Introduce a default for all
root
parameters, where the default is whattorchvision.datasets.getdir()
returns
Problem
(Yes, here the problem comes after the solution 😃)
For most datasets this should work OK. For a few of them (namely phototour, UCF101, Kinetics-400, HMDB51, Flickr, EMNIST, COCO), the root
parameter is followed by other parameters without a default, so we can’t introduce a default for root
without changing its place, which would break backward compatibility.
The easiest workaround here would be to introduce defaults for these other parameters. Otherwise, things get tricky.
Other considerations
Currently, datasets are inconsistent with respect to how they treat the root
: some will dump their data in root/TheDatasetName
like MNIST, but some will dump their data directly in root
like Places365
.
While unlikely, this can create conflicts between datasets if they use the same file names. Perhaps it would be safe here to “fix” the datasets like Places365
so that they all use a root/TheDatasetName
directory. This will create a minor inconvenience of re-downloading the dataset for some users, but it’s probably for the best?
CC @fmassa @datumbox @pmeier @prabhat00155 @parmeet
cc @pmeier
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:12 (5 by maintainers)
I think the common dataset root is a really good idea and we should go for it. I know the current state makes it harder to maintain, but there are no urgent issues solved by this. Thus, I suggest we wait with the change until the new datapipe functionality, which will break
torchvision.datasets
on multiple other parts. One more BC breaking change won’t make a difference.This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion,
$TORCH_HOME/vision/datasets|models/
will be cleaner.