Why do datasets not predefine a specific train/val/test split?
See original GitHub issue❓ Questions & Help
I am new to this repo (what an incredible effort! thank you for everyones’ work!) and am trying to conceptually think through the logic and structure how datasets are coded up. I read the documentation on how to create my own dataset, read in detail the most important classes, such as Dataset
and InMemoryDataset
and also went through several example usages of datasets.
My question is: Why are the datasets not (or at least not always) specifying a train/val/test split? I would find this split always useful if one directly wants to compare two models’ performance: You want to use the exact same split to train, validate and test on (one could argue that the validation dataset is user- rather than pre-defined, and taken from the train-dataset, but even then, at least the train/test split should be predefined). - I would find this feature to be important for proper usage of the datasets as fair benchmarks to compare against. Are there any other efforts in this repo towards this that I am not aware of?
Thinking about how this could be roughly implemented, it would probably require a dedicated “benchmark” class for every “dataset” that contains Dataset
objects (3 of them) as attributes.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
That’s true. There is some internal state used in the
download
andprocess
methods, e.g., the root folder, which currently prevents it from being a class method.Yes, this is inline with what torchvision is doing for its datasets. In the end, this is no restriction. You can well implement
download
andprocess
individually for each split, e.g.:To follow-up on your question, having multiple data instances is convenient for applying different transforms:
Implementing
download
andprocess
as class methods makes a lot of sense though, will thing about it!