RFC Unified way to process batches in the "MiniBatch" estimators
See original GitHub issueCurrently we have 3 “MiniBatch” estimators: MiniBatchKMeans
, MiniBatchDictionaryLearning
and MiniBatchSparsePCA
. The sparse pca is essentially the same as dict learning with dict and code exchanged. A new one is also incoming.
Since MiniBatchKMeans
and MiniBatchDictionaryLearning
are being reworked (#17622 and #18975), and MiniBatchNMF
is on development (#16948), I think it’s a good moment to unify the way we process the minibatches, because they currently all implement a different way :S
Here are the existing implementations with their pros and cons:
-
Sample a batch from the data at each iteration (with or without replacement) (currently in minibatch kmeans)
for i in range(n_iter): this_X = X[np.random.choice(n_samples, size=batch_size, replace=True/False)]
- pros: does not require the data to be shuffled; all batches have the same size
- cons: it’s not cheap, fancy indexing makes a copy of the batch. Not clear epochs if batch_size does not divide n_samples. no guarantee that all samples are seen at each epoch.
-
Pre-define the batches by generating contiguous slices and cycle (currently in minibatch dict-learning)
batches = gen_batches(n_samples, batch_size) batches = itertools.cycle(batches) for i, batch in zip(range(n_iter), batches): this_X = X[batch]
- pros: clear epochs (every completed cycle). free, batches are views on on data.
- cons: The last batch of each epoch can be smaller that batch_size (which can mess a bit with the stopping criterion). Requires the data to be shuffled (hence the shuffle param). The data is not re-shuffled between each epoch so there’s still some pattern in the batches.
@scikit-learn/core-devs please give your opinion on your preferred solution. Maybe there’s a most commonly accepted way of doing that / best practices ? Feel free to propose alternatives as well ! Also feel free to edit the pros and cons if I forgot something
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
I don’t fully agree that the most interesting usecase is partial_fit. Even if data fits in memory, fitting the estimator in an online manner can converge much faster than its full batch version. For instance MiniBacthKMeans can achieve similar objective function after only 1 or 2 epochs than a KMeans with ~10 to ~100 iterations.
I find the first option unexpected as it potentially doesn’t make use of all samples. My expectations would be somewhat closer to the second approach, but:
shuffle_inplace
bool parameter which would decide whether or not we make a copy of the input before the first shuffle (I think we sometimes call that “copy
” although it’s not the best descriptive name IMHO)Can you detail a bit on this? Shouldn’t the criteria be normalized w.r.t. the number of samples?