Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallelizing cross validation

See original GitHub issue

I have a nested loop that I’m trying to parallelize that looks like this

def est_model(data, params):
   return model

def score_data(data, model):
   return score

for dataset in Datasets:
    raw_obj = BigData[dataset] ## WANT TO WRAP THIS IN DELAYED
    processed_obj = raw_obj.process() ## TO DELAY THIS
   
    X = list(range(processed_obj.n_chunks))
    scores = []
    kf = KFold(n_splits=5)

    for training, validation in kf.split(X):     
        train = processed_obj[training]
        test = processed_obj[validation]
        model = delayed(est_model)(delayed(train, traverse=False), params)
        scores.append(delayed(score_data)(delayed(test, traverse=False), model))
    
    scores_by_dataset.append(scores)

sbs = dask.compute(*scores_by_datsaset)

It currently works fine, but I’d like to further parallelize the initial raw data processing. The problem is that knowing the number of chunks requires this pre-processing step. As far as I can see, this will mean that there is a for loop which depends on a delayed object. Is there a pattern for cross-validation that I’m just missing?

Issue Analytics

State:
Created 6 years ago
Comments:16 (10 by maintainers)

Top GitHub Comments

2reactions

TomAugspurgercommented, Mar 26, 2018

https://github.com/dask/dask-ml/compare/master...TomAugspurger:train-test-split?expand=1 has train_test_split mostly finished I think. I don’t recall why I didn’t make a PR with it yet.

Haven’t had time to work on the rest, so if anyone is interested in finishing off that PR or picking up the follow-on things like cross_val_score and the various CV strategies I’d appreciate it.

1reaction

TomAugspurgercommented, Oct 30, 2017

Thanks @kdubovikov

I’m hoping to implement some CV strategies (and utilities like cross_val_score) in dask-ml this week. I’ll update this issue when that’s done.

Top Results From Across the Web

Parallel cross-validation: A scalable fitting method for ...

The key idea is to divide the spatial domain into overlapping subsets and to use cross-validation (CV) to estimate the covariance parameters in...

Implement Cross-Validation Using Parallel Computing

In this example, use crossval to compute a cross-validation estimate of mean-squared error for a regression model. Run the computations in parallel.

Parallel computation of loops for cross-validation analysis

In this post we will show a worked example of how a cross-validation loop in can be parallelised. We will use PLS regression...

Model Parallelism in Spark ML Cross-Validation - Databricks

In this video, we will learn how tuning a Spark ML model with cross-validation can be an extremely computationally expensive process.

Parallelizing cross-validation | Forecasting Time Series Data ...

There is a lot of iteration going on during cross-validation and these are tasks that can be parallelized to speed things up.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Parallelizing cross validation

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

map_partitions tries to partition a pd.DataFrame given as argument to a mapped function

Keep original filenames in dask.dataframe.read_csv