question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperparameter optimization on LTM datasets with Dask linear models

See original GitHub issue

Hello all,

I’ve been trying to combine dask-ml’s tools in the most vanilla way that I can think of and they don’t seem to fit together (no pun intended).

Specifically, I want to both train models and optimize hyperparams on larger-than-memory datasets.

Initially I supposed I could just stick a linear model in a cv class. From the documentation, however, it seems that (all classes from dask_ml.model_selection) GridSearchCV and RandomizedSearchCV both require that the CV splits fit in memory, while IncrementalSearchCV, HyperbandSearchCV and SuccessiveHalvingSearchCV require that the estimator implements partial_fit. Since none of the linear models in the dask-ml API support partial_fit, I’m left wondering if there’s a way to use pure dask-ml for a ML workflow.

Something like:

import numpy as np
from dask.distributed import Client, LocalCluster
from dask import array as da, dataframe as ddf

from dask_ml.model_selection import RandomizedSearchCV, train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification

cluster = LocalCluster( # Ignore specific values, just an example
    n_workers=4,
    threads_per_worker=2,
    memory_limit="1024MB",
    dashboard_address="0.0.0.0:1234",
)

client = Client(cluster)

X, y = make_classification(
    n_samples=1_000_000,
    n_features=12,
    n_informative=3,
    n_redundant=1,
    n_classes=2,
    chunks=5000,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

L = (10 ** np.linspace(-5, 2, num=10)).tolist()

rscv = RandomizedSearchCV(
    LogisticRegression(),
    param_distributions={"C": L},
    n_iter=20,
    scheduler=client,
    cache_cv=False,
)

rscv.fit(X_train, y_train)

rscv.score(X_test, y_test)

(In case anyone’s wondering, the script above gives me an out of memory error and starts killing all the workers and I can’t find out why)

Thanks for any help! Cheers

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Oct 8, 2019

I think your expectation that dask_ml.model_selection.GridSearchCV should work well with a model accepting a dask Array is reasonable. I vaguely recall looking into this a while back but I don’t remember the outcome.

If you’re interested in investigating, it’d be nice to know which operations are causing the workers to error. Otherwise I’ll be able to look into it in a week or two.

I also think that Dask-ML’s linear models should implement partial_fit, which would let them work well with e.g. Hyperband.

0reactions
stsievertcommented, Oct 8, 2019

It seems pretty easy to add, right?

Yeah, I’d imagine because most optimizations are iterative. In practice, I’d imagine this would boil down to an implementation of warm_start or sending an initial estimate (or beta vector) to dask_glm/algorithms.py (e.g., gradient_descent(..., initial_model=previous_beta)).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hyperparameter optimization with Dask
Dask enables some new techniques and opportunities for hyperparameter optimization. One of these opportunities involves stopping training early to limit ...
Read more >
Supercharging Hyperparameter Tuning with Dask
Hyperparameter tuning is a crucial, and often painful, part of building machine learning models. Squeezing out each bit of performance from ...
Read more >
NVTabular is All-in on Dask - Medium
The goal of NVTabular is to both simplify and accelerate the tabular-data processing pipeline needed to train (and deploy) RecSys models on GPUs ......
Read more >
dask/dask - Gitter
They'll both be optimized down to the same result in the end I think. ... working with random forest and hyperparameter tuning with...
Read more >
How to do hyperparameter tuning with Dask? - ProjectPro
The hyperparameters controls the quality of the predictions made by models. For example, if learning rate is too small the model will take...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found