Feature Request for torchvision ImageFolder using/inheriting DatasetFolder
See original GitHub issueI came across a feature other users also demanded as may be seen in pytorch forums. For a detailed problem Description and how to solve it see here in a discussion from users, @ptrblck and me: https://discuss.pytorch.org/t/how-to-sample-images-belonging-to-particular-classes/43776/9
In short: Using ImageFolder, which inherits from DatasetFolder, is limiting the user to retrieve a whole dataset from a folder, instead of just using some classes/dirs of the folder structure. Even though one can implement a custom find_classes() method or rather call it a function if one passes it an overwritten DatasetFolder custom implementation, this is often hidden to the user, since one only uses ImageFolder which uses DatasetFolder under the hood.
For users getting this wrong also see the pytorch discussion from the link above in the forum, where @ptrblck and I figured out that it would be nice to be able to just pass such a function that only selects a subset of a folder structure directly by passing an optional function to the ImageFolder.
The line I am talking about in current torchvision DatasetFolder implementation, where subsets from a folder may be retrieved, by overwriting this function: https://github.com/pytorch/vision/blob/fba4f42e3bc24b7b2c6cad09b6db653ac73dc6b7/torchvision/datasets/folder.py#L144
My Suggestion for this improvement that users can use only a subset of a folder structure in ImageFolder looks as follows as also stated in the pytorch forum:
def find_classes(directory: str, desired_class_names: List) -> Tuple[List[str], Dict[str, int]]:
"""Finds the class folders in a dataset."""
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
classes = classes [desired_class_names] # TODO: do something like this line! Not tested it yet!
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
return classes, class_to_idx
Current implementation suggest overwriting the function as follows within DatasetFolder, but most Users tend to be using ImageFolder as I inferred from posts.
Also as stated @ptrblck suggested to make it possible to pass a function to ImageFolder directly instead of overwriting DatasetFolder. Regarding this i have no code to suggest but it might be trivial by just passing parameters.
cc @pmeier
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
I think that’s mostly because these discussions predate the 0.10 release which is only a few months old, where we made the overriding of
find_classes()
publicly available.The current docs for the ImageDataset are here: https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.ImageFolder
where we say
If you think there’s a more obvious way to expose this, you’re welcome to submit a PR 😃
The actual file you’ll need to edit is https://github.com/pytorch/vision/blob/main/torchvision/datasets/folder.py#L271:L271
and our contrributing guide is here 😃 https://github.com/pytorch/vision/blob/main/CONTRIBUTING.md
Reminder to self: add functionality to exclude folders to
torchvision.prototype.datasets.from_image_folder
.