question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperparameter optimization benchmarking

See original GitHub issue

It’d be nice to have some benchmarks for how our different hyperparameter optimizers perform. There are a few comparisons that would be useful

  1. dask_ml’s drop-in replacements for GridSearchCV & RandomizedSearchCV. We’re able to deconstruct Pipeline objects to avoid redundant fit calls. This benchmark would compare a GridSearchCV(Pipeline(...)) for dask_ml.model_selection.GridSearchCV and sklearn.model_selection.GridSearchCV. We’d expect Dask-ML’s to perform better the more CV splits there are and the more parameters that are explored early on in the pipeline (https://github.com/dask/dask-ml/issues/141 has some discussion).
  2. Scaling of Dask’s joblib backend for large problems. Internally, scikit-learn uses joblib for parallel for loops. With
with joblib.parallel_backend("dask"):
    ...

The items in the for loop are executed on the Dask Cluster. There are some issues with the backend (https://github.com/joblib/joblib/issues/1020, https://github.com/joblib/joblib/issues/1025). Fixing those aren’t in scope for this work, but we’d like to have benchmarks to understand the current performance and measure the speedup from fixing those. 3. General performance on large datasets with Incremental, Hyperband, etc. We can’t really compare to scikit-learn here, since it doesn’t handle larger-than-memory datasets. @stsievert may have some thoughts / benchmarks to share here.

cc @dankerrigan. This is more than enough work I think. If you’re able to make progress on any of these (or other things you think are important) it’d be great.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

8reactions
mrocklincommented, Apr 22, 2020

I think that I’m the western-most person who would be interested in this. My day starts around 2:30 UTC (7:30 US Pacific, 10:30 US Eastern). I suggest that if people are interested they click the Heart icon on this comment. I’ll then send out an e-mail with some scheduling options.

2reactions
pierreglasercommented, Apr 22, 2020

FYI I’m working a lot on improving the joblib/dask integration these days. Among other things, I’m building a benchmark suite for the joblib using the dask backend for a variety of workloads and use-cases, including things like scikit-learn cross validations, GridSearch etc. So I’m very interested by this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Benchmarking of hyperparameter optimization techniques for ...
In the context of HPO, a benchmark represents a particular HPO use case specified by an ML algorithm, hyperparameter configuration space, data set,...
Read more >
Efficient Benchmarking of Hyperparameter Optimizers via ...
In this work, we introduce another option: cheap-to-evaluate surrogates of real hyperparameter optimization benchmarks that share the same hyperparameter spaces ...
Read more >
HPOlib benchmarks - AutoML
To run these algorithms and datasets with hyperparameter optimizers you need to install. the HPOlib software from here; the benchmark data: An algorithm...
Read more >
Benchmarking the performance of Bayesian optimization ...
Bayesian optimization (BO) has been leveraged for guiding autonomous and high-throughput experiments in materials science.
Read more >
Reproducible and Efficient Benchmarks for Hyperparameter ...
multiobjective optimization of both BLEU and decoding time. These experiments illus- trate how to utilize the dataset to rigorously evaluate HPO for NMT....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found