question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel computing with nested cross-validation

See original GitHub issue

Dear sklearn’s experts,

Standard use of nested cross-validation within sklearn doesn’t allow multi-core computing. As in the example below, njobs has to be set to 1 for inner/outer loops:

gs = GridSearchCV(pipe_svc, param_grid, scoring=score_type, cv, n_jobs=1)
scores = cross_val_score(gs, X, y, scoring, cv, n_jobs=1)

Would there be any no too difficult way to parallelize jobs in nested cross-validation, which would allow to highly reduce time-consuming computing ?

Thanks in advance !

Best, Matthieu

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:18 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
robnacommented, Nov 3, 2022

Although, this is closed I think it would be good for an update, for users who land here now:

What would be the current best practise for parallelism in nested cross validation with sklearn today? Nov 2022: sklearn stable v. 1.1.3, or 1.2.dev0 would be the relevant recent versions @mattvan83 @NicolasHug

Running inside a jupyter notebook, I am trying to use parallel computation on a server (120 cpu cores) like so:

with parallel_backend('loky', n_jobs=-1):
    innerCV = GridSearchCV(
        pipe,
        params,
        scoring= scoring,
        refit= refit_scorer,
        cv=10,
        verbose=1,
        )

    outerCV = cross_validate(
        innerCV,
        model_X,
        model_y,
        scoring=scoring,
        cv=10,
        return_estimator=True,
        verbose=1,
        )

The pipe is my estimator object, which itself is a sklearn.pipeline wrapping some transformers and various options for estimators specified in my params grid.

It runs without errors, however, I am not sure if it is completely optimised. Some time during the fit I see load on all CPUs but most of the time just 10 of them get to work. I assume this is due to the cv=10 I am using here (though, is it the inner or outer the gets parallelised?).

The times when all CPUs are in use might be when an estimator is tested which has some internal (numpy) parallelisation, I assume?

So, is this a tangible way today to approach nested CV parallelisation in sklearn today? …Or would it be better to:

  • specify n_jobs in inner and outer CV individually instead of using a context manager?
    • and if so, should they both get n_jobs=-1?
  • be using dask backend be beneficial to loki (when on a single machine)
  • go about it in a completely different way?

Any guidance welcome! Thanks, robna.

1reaction
jnothmancommented, May 2, 2019

I don’t think we have a solution to this yet

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to speed up nested cross validation in python?
Any guidance on how I could speed this up would be appreciated. Edit: I have also tried using parallel processing with dask, but...
Read more >
MPI-based Nested Cross-Validation for scikit-learn
If you are working with machine learning, at some point you have to choose hyper-parameters for your model of choice and do cross-validation...
Read more >
Nested Cross-Validation for Machine Learning with Python
Nested cross -validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of ...
Read more >
nestedcv
Nested cross -validation (CV) provides a way to get round this, by maximising use of the whole dataset for testing overall accuracy, while...
Read more >
Nested cross-validation - Statistics - Julia Discourse
According to quite some literature, cross-validation (CV) is biased and should be replaced by nested cross-validation whenever you can, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found