Need a data sampler for unevenly distributed labels
See original GitHub issueWe need a way to handle a dataset whose label distribution is highly skewed. For example, when we have 1000 positives and 100 negatives, we want to make sure each batch contains the same number of positives and negatives, oversampling negative examples 10 times more than positives.
Someone said pytorch has a way to do:
Adding a sampler
option for iterator as PyTorch can solve the problem I guess.
http://pytorch.org/docs/data.html#torch.utils.data.DataLoader
Source: https://chainer.slack.com/archives/C0LC5A6C9/p1497343348496751
Maybe it’s possible for chainer to have a wrapper for these pytoarch data preprocessing utilities. DataLorder is not involved in gradient computation so it should be easy and take much less time than implementing equivalent functions from scratch.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:15 (7 by maintainers)
Top GitHub Comments
Fyi, #3429 is now merged. It would be great if we could provide some sort of balanced sampler.
This issue is closed as announced. Feel free to re-open it if needed.