Improvements around ReproducibleBatchSampler.
See original GitHub issueš Feature
IMHO an attempt should be made to not wrap or change the data pipeline objects. Despite the name, this ReproducibleBatchSampler
seems to be about making datasets resumable from the middle of an epoch by skipping the first few examples, rather than being reproducible since it doesnāt appear to set any seeds on the samplers.
(If this is not the case, I think the point is even stronger. Iād prefer not to have seeds and āreproducibilityā introduced into my data pipeline in the background. They are all good things, but not when it happens without my knowledge, or without an option to disable it).
Silently wrapping objects or creating new instances is can introduce unexpected issues, eg:
- point brought up in #812, about side effects.
ReproducibleBatchSampler
seems to assume itās wrapping the pytorchBatchSampler
. For instance, it assumes theBatchSampler
has asampler
instance variable. This this the case for the pytorch class, but not required in general.BatchSampler
is just a sampler where the__iter__
returns a batch of indexes.ReproducibleBatchSampler
samples all the indexes first. This isnāt necessarily a trivial amount of time or memory to hold l those ints if the dataset is large.
First and foremost, I believe this new behavior should be very prominently noted in the documentation, change notes, a warning in the logs, etc. I realized after the fact this is a ānoteā in the run
method engine, but as a user who is upgrading from 0.2.1, I would have had no idea this was happening if it wasnāt for my implementation of batch sampler trivially not being compatible.
Perhaps the behavior should be changed to: if the data loader stack is not holding an instance of ReproducibleBatchSampler
, then engine simply loads and does nothing with the batches that need to be skipped upon resume. Users who wish to have the ReproducibleBatchSampler
option can explicitly use this class when constructing DataLoader
.
Luckily, ignite is a small library and its code is very readable (awesome, very happy user, thanks!). Upon reading the code, I see that I can simply implement my own ReproducibleBatchSampler
to bypass the three concerns I pointed out above. For the case where batches are simply sampled iid with replacement out of the data, itās sufficient to simply shorten the number of elements of the first call to __iter__
when the dataset is resumed.
So this is more of a suggestion about default behavior, which I found to be somewhat unexpected.
Thanks for your work on this library!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:6 (3 by maintainers)
Top GitHub Comments
@vfdev-5 thanks for the response. Glad to see this is already being addressed. The solution in #895 seems good to me.
@amatsukawa we are very sorry for that ! Weāll update the library soon with more stable v0.4.0 release.
I think what can be temporary done is to convert torch DataLoader into an iterator and specify
epoch_length
in the run:Based on #714 (so, probably this work only on nightly release)