question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why do datasets not predefine a specific train/val/test split?

See original GitHub issue

❓ Questions & Help

I am new to this repo (what an incredible effort! thank you for everyones’ work!) and am trying to conceptually think through the logic and structure how datasets are coded up. I read the documentation on how to create my own dataset, read in detail the most important classes, such as Dataset and InMemoryDataset and also went through several example usages of datasets.

My question is: Why are the datasets not (or at least not always) specifying a train/val/test split? I would find this split always useful if one directly wants to compare two models’ performance: You want to use the exact same split to train, validate and test on (one could argue that the validation dataset is user- rather than pre-defined, and taken from the train-dataset, but even then, at least the train/test split should be predefined). - I would find this feature to be important for proper usage of the datasets as fair benchmarks to compare against. Are there any other efforts in this repo towards this that I am not aware of?

Thinking about how this could be roughly implemented, it would probably require a dedicated “benchmark” class for every “dataset” that contains Dataset objects (3 of them) as attributes.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rusty1scommented, Jul 21, 2020

That’s true. There is some internal state used in the download and process methods, e.g., the root folder, which currently prevents it from being a class method.

1reaction
rusty1scommented, Jul 19, 2020

Yes, this is inline with what torchvision is doing for its datasets. In the end, this is no restriction. You can well implement download and process individually for each split, e.g.:

def processed_file_names(self):
    return f'{split}.pt'

def process(self):
    # Only needs to implement process for current split.

To follow-up on your question, having multiple data instances is convenient for applying different transforms:

train_dataset = MyDataset(train=True, transform=AugmentData())
test_dataset = MyDataset(train=False)

Implementing download and process as class methods makes a lot of sense though, will thing about it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why do datasets not predefine a specific train/val/test split?
Let's assume a split or train attribute is present, then the resulting dataset object will only contain data from either the train, val...
Read more >
Train/test split that resembles original dataset and each other
We can always stratify our sample so that the distribution of the underlying variables is similar between the two sets ; stratified sampling ......
Read more >
Split data set with multiple entries of same subject into train/val ...
I do not want entries of the same subject to be in more than a single split (train/val/test) at the same time, to...
Read more >
How to split data into three sets (train, validation, and test) And ...
Sklearn train test split is not enough. We need something better, and faster ... You take a given dataset and divide it into...
Read more >
A Guide on Splitting Datasets With Train_test_split Function
To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset. If...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found