Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Status update on Incremental and Grid Search

See original GitHub issue

Quick status update: I wanted to explore a workflow that would use scikit-learn as much as possible. We’d use scikit-learn for all the hyper-parameter optimization and the actual training. Dask would just provide the large arrays.

The end goal is something as close as possible to GridSearchCV(SGDClassifier()) trained on a larger-than-memory dataset.

X, y = load_dask_arrays()

clf = sklearn.linear_model.SGDClassifier()
gs = sklearn.model_selection.GridSearchCV(clf, param_grid)
gs.fit(X, y)

First, we have to avoid passing a large Dask array to SGDClassifier.fit, as it would be converted to an ndarray. So we wrap it with Incremental, which passes blocks to the SGDClassifier.partial_fit:

import dask_ml.wrappers

X, y = load_dask_arrays()

clf = sklearn.linear_model.SGDClassifier()
inc = dsak_ml.wrappers.Incremental(clf)
gs = sklearn.model_selection.GridSearchCV(clf, param_grid)
gs.fit(X, y)

At this point, inc.fit(X, y) works fine (I’m currently debugging unexpectedly high memory usage, but let’s ignore that for now), but there’s a subtle issue with inc.score(X, y), which surfaces in GridSeachCV. By default, a pass-through scorer is used, which is SGDClassifier.score. This ends up using sklearn.metrics.accuracy_score, which end up converting the test Dask arrays into large ndarrays on a single worker. This is captured in #200.

A workaround to the scoring issue is to manually pass a scorer that is able to work well with Dask arrays. We’ve implemented a few in Dask-ML:

import dask_ml.wrappers
import dask_ml.metrics
from sklearn.metrics import make_scorer

X, y = load_dask_arrays()
scorer = make_scorer(dask_ml.metrics.accuracy_score)

clf  = sklearn.linear_model.SGDClassifier()
inc = dask_ml.wrappers.Incremental(clf, scoring=scorer)
gs = sklearn.model_selection.GridSearchCV(clf, param_grid)
gs.fit(X, y)

This gets us serial hyper-parameter optimization on larger-than-memory Dask arrays. To do things in parallel, we can use the distributed backend.

import dask_ml.joblib
from sklearn.externals import joblib


with joblib.parallel_backend("dask"):
    gs.fit(X, y)

I’ll post benchmarks later when I’ve run them.

Issue Analytics

State:
Created 5 years ago
Comments:18 (18 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Jun 8, 2018

cc @stsievert

0reactions

TomAugspurgercommented, Jun 29, 2018

https://github.com/dask/dask/pull/3588 solves the ordering issue.

Top Results From Across the Web

How to tune model hyper-parameters with grid search

Every scikit-learn model has hyper-parameters you can tune to obtain improvements. Here's how to find the right hyper-parameters using GridSearch.

GridSearchCV/RandomizedSearchCV with partial_fit in sklearn

Currently, I am trying to use SGDClassifier which can be trained on incremental data using the partial_fit method and also find the best...

How to set incremental fetch on individual Grid objects in a ...

In the expanded Properties and Formatting windows, go to Grid, enable "Enable incremental fetch in grid" and set the value as desired.

Incremental Any-Angle Path Planning on Grids

of a vertex in the search tree must also be its neigh- bor. We present Incremental ... the blockage status of the cells...

chainladder.GridSearch — Reserving in Python

Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. GridSearchCV implements a “fit” and a “score” method ...