Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KFold or RepeatedKFold with Incremental estimator

See original GitHub issue

Hi,

Is there an easy way to do KFold or RepeatedKFold over an Incremental estimator (e.g. SGDRegressor)?

As I understand, IncrementalSearchCV will yield a set of hyperparameters optimized on a single, fixed dataset (controlled by test_size). What I would like to do is have this optimization done on different validation datasets, like is typically done in KFold, something like:

GridSearchCV(Incremental(SGDRegressor(...)), params, cv=KFold(10))

Any help would be really appreciated, thanks!

Issue Analytics

State:
Created 5 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Oct 29, 2018

As I understand, IncrementalSearchCV will yield a set of hyperparameters optimized on a single, fixed dataset (controlled by test_size).

Your understanding is correct. Right now the strategy is to persist the test dataset once, and use it for all the calls to partial_fit.

One thing I’m not sure about is how allowing multiple CV passes over the data would change the meaning of parameters like patience and max_iter. If you specify, say, cv=5 do actually get 5 CV splits, or if max_iter is hit, do you stop?

Can I ask: what’s the motivation for multiple CV splits? Have you found it useful on large datasets in practice? See also https://github.com/dask/dask-ml/issues/303

0reactions

TomAugspurgercommented, Nov 1, 2018

In the past I’ve run into some issues with grid search and incremental. I haven’t dug into details yet though. I’ll put it on my todo list if no one beats me to ti.

On Thu, Nov 1, 2018 at 5:16 AM Team notifications@github.com wrote:

Yes, this will work, but it won’t work if you specify cv=KFold(5). At least I couldn’t make it work, but maybe I’m doing something wrong.

Hm… the updated gist https://gist.github.com/stsievert/0b8050ad5bb7c959d27f3f773793cdb9 works for me, or at least runs with out errors. Do you think you could create a minimal working example of the failure you’re seeing? It’d be useful if there is a bug.

Oh, nvm I think I see why, without running the code. I’m using KFold from dask-ml, and you’re using the one from sklearn. Are you sure the one from sklearn doesn’t make you store pandas frames/numpy arrays in memory when splitting? Will try later and confirm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/421#issuecomment-435022892, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIgCXhjSRSX9ma2Xgfiux5dzf8vvVks5uquYVgaJpZM4X-ZwR .