Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperparameter optimization on LTM datasets with Dask linear models

See original GitHub issue

Hello all,

I’ve been trying to combine dask-ml’s tools in the most vanilla way that I can think of and they don’t seem to fit together (no pun intended).

Specifically, I want to both train models and optimize hyperparams on larger-than-memory datasets.

Initially I supposed I could just stick a linear model in a cv class. From the documentation, however, it seems that (all classes from dask_ml.model_selection) GridSearchCV and RandomizedSearchCV both require that the CV splits fit in memory, while IncrementalSearchCV, HyperbandSearchCV and SuccessiveHalvingSearchCV require that the estimator implements partial_fit. Since none of the linear models in the dask-ml API support partial_fit, I’m left wondering if there’s a way to use pure dask-ml for a ML workflow.

Something like:

import numpy as np
from dask.distributed import Client, LocalCluster
from dask import array as da, dataframe as ddf

from dask_ml.model_selection import RandomizedSearchCV, train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification

cluster = LocalCluster( # Ignore specific values, just an example
    n_workers=4,
    threads_per_worker=2,
    memory_limit="1024MB",
    dashboard_address="0.0.0.0:1234",
)

client = Client(cluster)

X, y = make_classification(
    n_samples=1_000_000,
    n_features=12,
    n_informative=3,
    n_redundant=1,
    n_classes=2,
    chunks=5000,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

L = (10 ** np.linspace(-5, 2, num=10)).tolist()

rscv = RandomizedSearchCV(
    LogisticRegression(),
    param_distributions={"C": L},
    n_iter=20,
    scheduler=client,
    cache_cv=False,
)

rscv.fit(X_train, y_train)

rscv.score(X_test, y_test)

(In case anyone’s wondering, the script above gives me an out of memory error and starts killing all the workers and I can’t find out why)

Thanks for any help! Cheers

Issue Analytics

State:
Created 4 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Oct 8, 2019

I think your expectation that dask_ml.model_selection.GridSearchCV should work well with a model accepting a dask Array is reasonable. I vaguely recall looking into this a while back but I don’t remember the outcome.

If you’re interested in investigating, it’d be nice to know which operations are causing the workers to error. Otherwise I’ll be able to look into it in a week or two.

I also think that Dask-ML’s linear models should implement partial_fit, which would let them work well with e.g. Hyperband.

0reactions

stsievertcommented, Oct 8, 2019

It seems pretty easy to add, right?

Yeah, I’d imagine because most optimizations are iterative. In practice, I’d imagine this would boil down to an implementation of warm_start or sending an initial estimate (or beta vector) to dask_glm/algorithms.py (e.g., gradient_descent(..., initial_model=previous_beta)).

Top Results From Across the Web

Hyperparameter optimization with Dask

Dask enables some new techniques and opportunities for hyperparameter optimization. One of these opportunities involves stopping training early to limit ...

Supercharging Hyperparameter Tuning with Dask

Hyperparameter tuning is a crucial, and often painful, part of building machine learning models. Squeezing out each bit of performance from ...

NVTabular is All-in on Dask - Medium

The goal of NVTabular is to both simplify and accelerate the tabular-data processing pipeline needed to train (and deploy) RecSys models on GPUs ......

dask/dask - Gitter

They'll both be optimized down to the same result in the end I think. ... working with random forest and hyperparameter tuning with...

How to do hyperparameter tuning with Dask? - ProjectPro

The hyperparameters controls the quality of the predictions made by models. For example, if learning rate is too small the model will take...