question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] A common dataset root

See original GitHub issue

Currently, all datasets have a mandatory root parameter, indicating where the dataset will be or has been downloaded.

It would be more convenient if users didn’t need to pass the root, and just rely on some predefined default behaviour. Also, having a default for all datasets will allow places with no internet access (looking at you fbcode 👀) to dump all datasets once and for all at the root, and have a seamless access to it afterwards.

Note that for downloading model weights e.g. using fasterrcnn_resnet50_fpn(pretrained=True), we internally rely on load_state_dict_from_url() which will download the weights in what torch.hub.getdir() returns (by default, this is $TORCH_HOME/hub/).

(BTW, where the models are downloaded doesn’t seem to be configurable on the torchvision side, but that’s another story)

In scikit-learn, a similar logic is used: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.get_data_home.html

Solution

As previously discussed a bit with @pmeier and @datumbox we could do something similar for torchvision datasets:

  • Introduce torchvision.datasets.getdir() and torchvision.datasets.setdir(). By default getdir() would return $TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/.
  • Introduce a default for all root parameters, where the default is what torchvision.datasets.getdir() returns

Problem

(Yes, here the problem comes after the solution 😃)

For most datasets this should work OK. For a few of them (namely phototour, UCF101, Kinetics-400, HMDB51, Flickr, EMNIST, COCO), the root parameter is followed by other parameters without a default, so we can’t introduce a default for root without changing its place, which would break backward compatibility.

The easiest workaround here would be to introduce defaults for these other parameters. Otherwise, things get tricky.

Other considerations

Currently, datasets are inconsistent with respect to how they treat the root: some will dump their data in root/TheDatasetName like MNIST, but some will dump their data directly in root like Places365.

While unlikely, this can create conflicts between datasets if they use the same file names. Perhaps it would be safe here to “fix” the datasets like Places365 so that they all use a root/TheDatasetName directory. This will create a minor inconvenience of re-downloading the dataset for some users, but it’s probably for the best?

CC @fmassa @datumbox @pmeier @prabhat00155 @parmeet

cc @pmeier

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:6
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
pmeiercommented, May 5, 2021

I think the common dataset root is a really good idea and we should go for it. I know the current state makes it harder to maintain, but there are no urgent issues solved by this. Thus, I suggest we wait with the change until the new datapipe functionality, which will break torchvision.datasets on multiple other parts. One more BC breaking change won’t make a difference.

2reactions
mthrokcommented, May 7, 2021
  • $TORCH_HOME/torchvision_datasets/, which is consistent with $TORCH_HOME/hub/

This does not look consistent to me. Considering the possibility to place model parameter files in a similar fashion, $TORCH_HOME/vision/datasets|models/ will be cleaner.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Common Dataset Organization and Policy
The RFC allows a gate to confirm that things are compliant and necessary. The RFC should include: Description and reason for addition/change/deletion; Target ......
Read more >
The Architecture of the Common Indexing Protocol (CIP) RFC ...
The Architecture of the Common Indexing Protocol (CIP) RFC 2651 · 1 The CIP Index Object A CIP index object is composed of...
Read more >
[RFC-249] Common Dataset Organization and Policy - Jira
Building upon RFC-95, this RFC nails down the specific format and policies governing shared datasets available in /datasets.
Read more >
RFC 7532: Namespace Database (NSDB) Protocol for ...
It is not required by the federation that the namespace be common across all fileservers. It should be possible to have several independently...
Read more >
Root Cause Analysis Overview - SAP Support Portal
... a RFC call and finally results in a SQL statement which retrieves information from the ERP database. Figure: A typical scenario for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found