Hyperparameter optimization on LTM datasets with Dask linear models
See original GitHub issueHello all,
I’ve been trying to combine dask-ml’s tools in the most vanilla way that I can think of and they don’t seem to fit together (no pun intended).
Specifically, I want to both train models and optimize hyperparams on larger-than-memory datasets.
Initially I supposed I could just stick a linear model in a cv class. From the documentation, however, it seems that (all classes from dask_ml.model_selection) GridSearchCV and RandomizedSearchCV both require that the CV splits fit in memory, while IncrementalSearchCV, HyperbandSearchCV and SuccessiveHalvingSearchCV require that the estimator implements partial_fit. Since none of the linear models in the dask-ml API support partial_fit, I’m left wondering if there’s a way to use pure dask-ml for a ML workflow.
Something like:
import numpy as np
from dask.distributed import Client, LocalCluster
from dask import array as da, dataframe as ddf
from dask_ml.model_selection import RandomizedSearchCV, train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification
cluster = LocalCluster( # Ignore specific values, just an example
n_workers=4,
threads_per_worker=2,
memory_limit="1024MB",
dashboard_address="0.0.0.0:1234",
)
client = Client(cluster)
X, y = make_classification(
n_samples=1_000_000,
n_features=12,
n_informative=3,
n_redundant=1,
n_classes=2,
chunks=5000,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
L = (10 ** np.linspace(-5, 2, num=10)).tolist()
rscv = RandomizedSearchCV(
LogisticRegression(),
param_distributions={"C": L},
n_iter=20,
scheduler=client,
cache_cv=False,
)
rscv.fit(X_train, y_train)
rscv.score(X_test, y_test)
(In case anyone’s wondering, the script above gives me an out of memory error and starts killing all the workers and I can’t find out why)
Thanks for any help! Cheers
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
I think your expectation that
dask_ml.model_selection.GridSearchCV
should work well with a model accepting a dask Array is reasonable. I vaguely recall looking into this a while back but I don’t remember the outcome.If you’re interested in investigating, it’d be nice to know which operations are causing the workers to error. Otherwise I’ll be able to look into it in a week or two.
I also think that Dask-ML’s linear models should implement
partial_fit
, which would let them work well with e.g. Hyperband.Yeah, I’d imagine because most optimizations are iterative. In practice, I’d imagine this would boil down to an implementation of
warm_start
or sending an initial estimate (orbeta
vector) to dask_glm/algorithms.py (e.g.,gradient_descent(..., initial_model=previous_beta)
).