question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KFold or RepeatedKFold with Incremental estimator

See original GitHub issue

Hi,

Is there an easy way to do KFold or RepeatedKFold over an Incremental estimator (e.g. SGDRegressor)?

As I understand, IncrementalSearchCV will yield a set of hyperparameters optimized on a single, fixed dataset (controlled by test_size). What I would like to do is have this optimization done on different validation datasets, like is typically done in KFold, something like:

GridSearchCV(Incremental(SGDRegressor(...)), params, cv=KFold(10))

Any help would be really appreciated, thanks!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Oct 29, 2018

As I understand, IncrementalSearchCV will yield a set of hyperparameters optimized on a single, fixed dataset (controlled by test_size).

Your understanding is correct. Right now the strategy is to persist the test dataset once, and use it for all the calls to partial_fit.

One thing I’m not sure about is how allowing multiple CV passes over the data would change the meaning of parameters like patience and max_iter. If you specify, say, cv=5 do actually get 5 CV splits, or if max_iter is hit, do you stop?

Can I ask: what’s the motivation for multiple CV splits? Have you found it useful on large datasets in practice? See also https://github.com/dask/dask-ml/issues/303

0reactions
TomAugspurgercommented, Nov 1, 2018

In the past I’ve run into some issues with grid search and incremental. I haven’t dug into details yet though. I’ll put it on my todo list if no one beats me to ti.

On Thu, Nov 1, 2018 at 5:16 AM Team notifications@github.com wrote:

Yes, this will work, but it won’t work if you specify cv=KFold(5). At least I couldn’t make it work, but maybe I’m doing something wrong.

Hm… the updated gist https://gist.github.com/stsievert/0b8050ad5bb7c959d27f3f773793cdb9 works for me, or at least runs with out errors. Do you think you could create a minimal working example of the failure you’re seeing? It’d be useful if there is a bug.

Oh, nvm I think I see why, without running the code. I’m using KFold from dask-ml, and you’re using the one from sklearn. Are you sure the one from sklearn doesn’t make you store pandas frames/numpy arrays in memory when splitting? Will try later and confirm.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/421#issuecomment-435022892, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIgCXhjSRSX9ma2Xgfiux5dzf8vvVks5uquYVgaJpZM4X-ZwR .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Repeated k-Fold Cross-Validation for Model Evaluation in ...
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model.
Read more >
The importance of k-fold cross-validation for model prediction ...
This article will discuss and analyze the importance of k-fold ... estimating a model without doing any type of cross-validation at all.
Read more >
Choice of K in K-fold Cross Validation | by Jeremy Walthers
Jeremy Walthers"Choice of K in K-fold Cross Validation "Kaggle Days San Francisco held in April 2019 gathered over 300 participants to meet, ...
Read more >
k-Fold and Repeated k-Fold Cross Validation in Python
One way to address this possible noise is to estimate the model accurary/performance based on running k-fold a number of times and ...
Read more >
Why applying cross validation before training a model
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found