Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why do datasets not predefine a specific train/val/test split?

See original GitHub issue

❓ Questions & Help

I am new to this repo (what an incredible effort! thank you for everyones’ work!) and am trying to conceptually think through the logic and structure how datasets are coded up. I read the documentation on how to create my own dataset, read in detail the most important classes, such as Dataset and InMemoryDataset and also went through several example usages of datasets.

My question is: Why are the datasets not (or at least not always) specifying a train/val/test split? I would find this split always useful if one directly wants to compare two models’ performance: You want to use the exact same split to train, validate and test on (one could argue that the validation dataset is user- rather than pre-defined, and taken from the train-dataset, but even then, at least the train/test split should be predefined). - I would find this feature to be important for proper usage of the datasets as fair benchmarks to compare against. Are there any other efforts in this repo towards this that I am not aware of?

Thinking about how this could be roughly implemented, it would probably require a dedicated “benchmark” class for every “dataset” that contains Dataset objects (3 of them) as attributes.

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

rusty1scommented, Jul 21, 2020

That’s true. There is some internal state used in the download and process methods, e.g., the root folder, which currently prevents it from being a class method.

1reaction

rusty1scommented, Jul 19, 2020

Yes, this is inline with what torchvision is doing for its datasets. In the end, this is no restriction. You can well implement download and process individually for each split, e.g.:

def processed_file_names(self):
    return f'{split}.pt'

def process(self):
    # Only needs to implement process for current split.

To follow-up on your question, having multiple data instances is convenient for applying different transforms:

train_dataset = MyDataset(train=True, transform=AugmentData())
test_dataset = MyDataset(train=False)

Implementing download and process as class methods makes a lot of sense though, will thing about it!

Top Results From Across the Web

Why do datasets not predefine a specific train/val/test split?

Let's assume a split or train attribute is present, then the resulting dataset object will only contain data from either the train, val...

Train/test split that resembles original dataset and each other

We can always stratify our sample so that the distribution of the underlying variables is similar between the two sets ; stratified sampling ......

Split data set with multiple entries of same subject into train/val ...

I do not want entries of the same subject to be in more than a single split (train/val/test) at the same time, to...

How to split data into three sets (train, validation, and test) And ...

Sklearn train test split is not enough. We need something better, and faster ... You take a given dataset and divide it into...

A Guide on Splitting Datasets With Train_test_split Function

To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset. If...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Why do datasets not predefine a specific train/val/test split?

❓ Questions & Help

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Roadmap] New PyTorch Geometric Releases

Failed to build wheel for torch-scatter: ERROR: Command errored out with exit status 1: