question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelizing cross validation

See original GitHub issue

I have a nested loop that I’m trying to parallelize that looks like this

def est_model(data, params):
   return model

def score_data(data, model):
   return score

for dataset in Datasets:
    raw_obj = BigData[dataset] ## WANT TO WRAP THIS IN DELAYED
    processed_obj = raw_obj.process() ## TO DELAY THIS
   
    X = list(range(processed_obj.n_chunks))
    scores = []
    kf = KFold(n_splits=5)

    for training, validation in kf.split(X):     
        train = processed_obj[training]
        test = processed_obj[validation]
        model = delayed(est_model)(delayed(train, traverse=False), params)
        scores.append(delayed(score_data)(delayed(test, traverse=False), model))
    
    scores_by_dataset.append(scores)

sbs = dask.compute(*scores_by_datsaset)

It currently works fine, but I’d like to further parallelize the initial raw data processing. The problem is that knowing the number of chunks requires this pre-processing step. As far as I can see, this will mean that there is a for loop which depends on a delayed object. Is there a pattern for cross-validation that I’m just missing?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:16 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
TomAugspurgercommented, Mar 26, 2018

https://github.com/dask/dask-ml/compare/master...TomAugspurger:train-test-split?expand=1 has train_test_split mostly finished I think. I don’t recall why I didn’t make a PR with it yet.

Haven’t had time to work on the rest, so if anyone is interested in finishing off that PR or picking up the follow-on things like cross_val_score and the various CV strategies I’d appreciate it.

1reaction
TomAugspurgercommented, Oct 30, 2017

Thanks @kdubovikov

I’m hoping to implement some CV strategies (and utilities like cross_val_score) in dask-ml this week. I’ll update this issue when that’s done.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallel cross-validation: A scalable fitting method for ...
The key idea is to divide the spatial domain into overlapping subsets and to use cross-validation (CV) to estimate the covariance parameters in...
Read more >
Implement Cross-Validation Using Parallel Computing
In this example, use crossval to compute a cross-validation estimate of mean-squared error for a regression model. Run the computations in parallel.
Read more >
Parallel computation of loops for cross-validation analysis
In this post we will show a worked example of how a cross-validation loop in can be parallelised. We will use PLS regression...
Read more >
Model Parallelism in Spark ML Cross-Validation - Databricks
In this video, we will learn how tuning a Spark ML model with cross-validation can be an extremely computationally expensive process.
Read more >
Parallelizing cross-validation | Forecasting Time Series Data ...
There is a lot of iteration going on during cross-validation and these are tasks that can be parallelized to speed things up.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found