question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sampling patches from multiple subject datasets

See original GitHub issue

🚀 Feature Suppose I have two (or more) SubjectsDatasets. One dataset (dataset A) with scans from hospital A and another dataset (dataset B) with scans from hospital B. Now I want that my model sees as much patches from dataset A as from dataset B. However, dataset A contains 500 subjects and dataset B contains only 10 subjects.

In tensorflow there is a method sample_from_datasets. It would look like this for this example:

dataset = tf.data.experimental.sample_from_datasets(
    [dataset_A, dataset_B], weights=[1,50], seed=None
)

This new dataset is then the dataset that could be passed to the torchio Queue.

In PyTorch this looks a bit different, but the result is the same:

sets = [dataset_A, dataset_B]
dataset = ConcatDataset(sets)

dist = np.concatenate([[(len(dataset) - len(s))/len(dataset)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samples=min([len(s) for s in sets] * len(sets))
dataloader = DataLoader(dataset, sampler=sampler)

What would be the best way to integrate this feature in TorchIO?

  • being able to also pass a sampler to the Queue?
  • having a SampledSubjectsDataset(datasets: List[SubjectsDataset], weights=List[float]) that can be passed to the queue instead of a SubjectsDataset?
  • how many patients to sample in one epoch when you have unbalanced datasets? Then length of the smallest datasets times the number of datasets?

Motivation

Training with balanced datasets is known to be important to get good results

Alternatives

An alternative is deepcopying dataset B 50 times (before images get loaded) to get it balanced. However, not a very nice solution.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
fepegarcommented, Dec 17, 2020

@dmus beat me. The thing is that preprocessed images are not stored in the dataset, because they are basically lists of paths with methods to load, transform and return the images.

1reaction
dmuscommented, Dec 17, 2020

I don’t think there is a need in this case. In the __get_item method of the SubjectsDataset is a need, there a deepcopy is made to avoid doing preprocessing on an already preprocessed subject

Read more comments on GitHub >

github_iconTop Results From Across the Web

Patch samplers - Training - TorchIO
Samplers are used to randomly extract patches from volumes. They are called with a sample generated by a SubjectsDataset and return a Python...
Read more >
Multi-Descriptor Random Sampling for Patch-Based Face ...
A random patch sampling method for face recognition under various ... dataset with more challenging illumination and occlusion conditions.
Read more >
Rethinking the Queue class to get full GPU utilization #393
This is a training with 5 subjects and sampling 40 patches per volume and ... when sampling from multiple subject datasets sampler=sampler, ...
Read more >
TF data API: how to efficiently sample small patches from images
One simple way that was discussed previously is to generate multiple patches from an image and flatten them. However this has the unfortunate ......
Read more >
PatchDB: A Large-Scale Security Patch Dataset
patch datasets suffer from insufficient samples and low varieties. ... composition by classifying security patches into multiple cat-.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found