Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC Unified way to process batches in the "MiniBatch" estimators

See original GitHub issue

Currently we have 3 “MiniBatch” estimators: MiniBatchKMeans, MiniBatchDictionaryLearning and MiniBatchSparsePCA. The sparse pca is essentially the same as dict learning with dict and code exchanged. A new one is also incoming.

Since MiniBatchKMeans and MiniBatchDictionaryLearning are being reworked (#17622 and #18975), and MiniBatchNMF is on development (#16948), I think it’s a good moment to unify the way we process the minibatches, because they currently all implement a different way :S

Here are the existing implementations with their pros and cons:

Sample a batch from the data at each iteration (with or without replacement) (currently in minibatch kmeans)
```
for i in range(n_iter):
     this_X = X[np.random.choice(n_samples, size=batch_size, replace=True/False)]
```
- pros: does not require the data to be shuffled; all batches have the same size
- cons: it’s not cheap, fancy indexing makes a copy of the batch. Not clear epochs if batch_size does not divide n_samples. no guarantee that all samples are seen at each epoch.
Pre-define the batches by generating contiguous slices and cycle (currently in minibatch dict-learning)
```
batches = gen_batches(n_samples, batch_size)
batches = itertools.cycle(batches)

for i, batch in zip(range(n_iter), batches):
    this_X = X[batch]
```
- pros: clear epochs (every completed cycle). free, batches are views on on data.
- cons: The last batch of each epoch can be smaller that batch_size (which can mess a bit with the stopping criterion). Requires the data to be shuffled (hence the shuffle param). The data is not re-shuffled between each epoch so there’s still some pattern in the batches.

@scikit-learn/core-devs please give your opinion on your preferred solution. Maybe there’s a most commonly accepted way of doing that / best practices ? Feel free to propose alternatives as well ! Also feel free to edit the pros and cons if I forgot something

Issue Analytics

State:
Created 3 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

jeremiedbbcommented, Feb 5, 2021

Note that mini-batch estimators are specifically designed to handle stream of data, when the data is too big to be in memory. So the most interesting usecase of these estimators is with partial_fit

I don’t fully agree that the most interesting usecase is partial_fit. Even if data fits in memory, fitting the estimator in an online manner can converge much faster than its full batch version. For instance MiniBacthKMeans can achieve similar objective function after only 1 or 2 epochs than a KMeans with ~10 to ~100 iterations.

1reaction

NicolasHugcommented, Feb 5, 2021

I find the first option unexpected as it potentially doesn’t make use of all samples. My expectations would be somewhat closer to the second approach, but:

with re-shuffling at each epoch so that all epochs are different (that’s how they do batched gradient descent, don’t they?)
with a shuffle_inplace bool parameter which would decide whether or not we make a copy of the input before the first shuffle (I think we sometimes call that “copy” although it’s not the best descriptive name IMHO)

The last batch of each epoch can be smaller that batch_size (which can mess a bit with the stopping criterion).

Can you detail a bit on this? Shouldn’t the criteria be normalized w.r.t. the number of samples?

Top Results From Across the Web

A Gentle Introduction to Mini-Batch Gradient Descent and ...

Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but ......

Gaussian Process Parameter Estimation Using Mini-batch ...

More recently, Quiñonero-Candela and Rasmussen (2005) unified previous approximation methods into a single probabilistic framework based on inducing points.

Adaptive optimization techniques for context-aware information ...

throughput is achieved by processing messages in batches instead of ... This way, a CIF provides a unified solution to tailor information delivery....

Available CRAN Packages By Date of Publication

2022-11-15, hermiter, Efficient Sequential and Batch Estimation of ... 2022-11-06, future, Unified Parallel and Distributed Processing in R for Everyone.

A survey on data analysis on large-Scale wireless networks

Batch processing requires a large amount of memory for data storage and implies ... learning algorithms have wide use in estimating these positions...