Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel computing with nested cross-validation

See original GitHub issue

Dear sklearn’s experts,

Standard use of nested cross-validation within sklearn doesn’t allow multi-core computing. As in the example below, njobs has to be set to 1 for inner/outer loops:

gs = GridSearchCV(pipe_svc, param_grid, scoring=score_type, cv, n_jobs=1)
scores = cross_val_score(gs, X, y, scoring, cv, n_jobs=1)

Would there be any no too difficult way to parallelize jobs in nested cross-validation, which would allow to highly reduce time-consuming computing ?

Thanks in advance !

Best, Matthieu

Issue Analytics

State:
Created 6 years ago
Comments:18 (14 by maintainers)

Top GitHub Comments

1reaction

robnacommented, Nov 3, 2022

Although, this is closed I think it would be good for an update, for users who land here now:

What would be the current best practise for parallelism in nested cross validation with sklearn today? Nov 2022: sklearn stable v. 1.1.3, or 1.2.dev0 would be the relevant recent versions @mattvan83 @NicolasHug

Running inside a jupyter notebook, I am trying to use parallel computation on a server (120 cpu cores) like so:

with parallel_backend('loky', n_jobs=-1):
    innerCV = GridSearchCV(
        pipe,
        params,
        scoring= scoring,
        refit= refit_scorer,
        cv=10,
        verbose=1,
        )

    outerCV = cross_validate(
        innerCV,
        model_X,
        model_y,
        scoring=scoring,
        cv=10,
        return_estimator=True,
        verbose=1,
        )

The pipe is my estimator object, which itself is a sklearn.pipeline wrapping some transformers and various options for estimators specified in my params grid.

It runs without errors, however, I am not sure if it is completely optimised. Some time during the fit I see load on all CPUs but most of the time just 10 of them get to work. I assume this is due to the cv=10 I am using here (though, is it the inner or outer the gets parallelised?).

The times when all CPUs are in use might be when an estimator is tested which has some internal (numpy) parallelisation, I assume?

So, is this a tangible way today to approach nested CV parallelisation in sklearn today? …Or would it be better to:

specify n_jobs in inner and outer CV individually instead of using a context manager?
- and if so, should they both get n_jobs=-1?
be using dask backend be beneficial to loki (when on a single machine)
go about it in a completely different way?

Any guidance welcome! Thanks, robna.

1reaction

jnothmancommented, May 2, 2019

I don’t think we have a solution to this yet