Parallelizing cross validation
See original GitHub issueI have a nested loop that I’m trying to parallelize that looks like this
def est_model(data, params):
return model
def score_data(data, model):
return score
for dataset in Datasets:
raw_obj = BigData[dataset] ## WANT TO WRAP THIS IN DELAYED
processed_obj = raw_obj.process() ## TO DELAY THIS
X = list(range(processed_obj.n_chunks))
scores = []
kf = KFold(n_splits=5)
for training, validation in kf.split(X):
train = processed_obj[training]
test = processed_obj[validation]
model = delayed(est_model)(delayed(train, traverse=False), params)
scores.append(delayed(score_data)(delayed(test, traverse=False), model))
scores_by_dataset.append(scores)
sbs = dask.compute(*scores_by_datsaset)
It currently works fine, but I’d like to further parallelize the initial raw data processing. The problem is that knowing the number of chunks requires this pre-processing step. As far as I can see, this will mean that there is a for loop which depends on a delayed object. Is there a pattern for cross-validation that I’m just missing?
Issue Analytics
- State:
- Created 6 years ago
- Comments:16 (10 by maintainers)
Top Results From Across the Web
Parallel cross-validation: A scalable fitting method for ...
The key idea is to divide the spatial domain into overlapping subsets and to use cross-validation (CV) to estimate the covariance parameters in...
Read more >Implement Cross-Validation Using Parallel Computing
In this example, use crossval to compute a cross-validation estimate of mean-squared error for a regression model. Run the computations in parallel.
Read more >Parallel computation of loops for cross-validation analysis
In this post we will show a worked example of how a cross-validation loop in can be parallelised. We will use PLS regression...
Read more >Model Parallelism in Spark ML Cross-Validation - Databricks
In this video, we will learn how tuning a Spark ML model with cross-validation can be an extremely computationally expensive process.
Read more >Parallelizing cross-validation | Forecasting Time Series Data ...
There is a lot of iteration going on during cross-validation and these are tasks that can be parallelized to speed things up.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
https://github.com/dask/dask-ml/compare/master...TomAugspurger:train-test-split?expand=1 has
train_test_split
mostly finished I think. I don’t recall why I didn’t make a PR with it yet.Haven’t had time to work on the rest, so if anyone is interested in finishing off that PR or picking up the follow-on things like
cross_val_score
and the various CV strategies I’d appreciate it.Thanks @kdubovikov
I’m hoping to implement some CV strategies (and utilities like cross_val_score) in dask-ml this week. I’ll update this issue when that’s done.