Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Status update on Incremental and Grid Search

See original GitHub issue

Quick status update: I wanted to explore a workflow that would use scikit-learn as much as possible. We’d use scikit-learn for all the hyper-parameter optimization and the actual training. Dask would just provide the large arrays.

The end goal is something as close as possible to GridSearchCV(SGDClassifier()) trained on a larger-than-memory dataset.

X, y = load_dask_arrays()

clf = sklearn.linear_model.SGDClassifier()
gs = sklearn.model_selection.GridSearchCV(clf, param_grid), y)

First, we have to avoid passing a large Dask array to, as it would be converted to an ndarray. So we wrap it with Incremental, which passes blocks to the SGDClassifier.partial_fit:

import dask_ml.wrappers

X, y = load_dask_arrays()

clf = sklearn.linear_model.SGDClassifier()
inc = dsak_ml.wrappers.Incremental(clf)
gs = sklearn.model_selection.GridSearchCV(clf, param_grid), y)

At this point,, y) works fine (I’m currently debugging unexpectedly high memory usage, but let’s ignore that for now), but there’s a subtle issue with inc.score(X, y), which surfaces in GridSeachCV. By default, a pass-through scorer is used, which is SGDClassifier.score. This ends up using sklearn.metrics.accuracy_score, which end up converting the test Dask arrays into large ndarrays on a single worker. This is captured in #200.

A workaround to the scoring issue is to manually pass a scorer that is able to work well with Dask arrays. We’ve implemented a few in Dask-ML:

import dask_ml.wrappers
import dask_ml.metrics
from sklearn.metrics import make_scorer

X, y = load_dask_arrays()
scorer = make_scorer(dask_ml.metrics.accuracy_score)

clf  = sklearn.linear_model.SGDClassifier()
inc = dask_ml.wrappers.Incremental(clf, scoring=scorer)
gs = sklearn.model_selection.GridSearchCV(clf, param_grid), y)

This gets us serial hyper-parameter optimization on larger-than-memory Dask arrays. To do things in parallel, we can use the distributed backend.

import dask_ml.joblib
from sklearn.externals import joblib

with joblib.parallel_backend("dask"):, y)

I’ll post benchmarks later when I’ve run them.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

TomAugspurgercommented, Jun 8, 2018
TomAugspurgercommented, Jun 29, 2018 solves the ordering issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to tune model hyper-parameters with grid search
Every scikit-learn model has hyper-parameters you can tune to obtain improvements. Here's how to find the right hyper-parameters using GridSearch.
Read more >
GridSearchCV/RandomizedSearchCV with partial_fit in sklearn
Currently, I am trying to use SGDClassifier which can be trained on incremental data using the partial_fit method and also find the best...
Read more >
How to set incremental fetch on individual Grid objects in a ...
In the expanded Properties and Formatting windows, go to Grid, enable "Enable incremental fetch in grid" and set the value as desired.
Read more >
Incremental Any-Angle Path Planning on Grids
of a vertex in the search tree must also be its neigh- bor. We present Incremental ... the blockage status of the cells...
Read more >
chainladder.GridSearch — Reserving in Python
Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. GridSearchCV implements a “fit” and a “score” method ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found