Parallel computing with nested cross-validation
See original GitHub issueDear sklearn’s experts,
Standard use of nested cross-validation within sklearn doesn’t allow multi-core computing. As in the example below, njobs has to be set to 1 for inner/outer loops:
gs = GridSearchCV(pipe_svc, param_grid, scoring=score_type, cv, n_jobs=1)
scores = cross_val_score(gs, X, y, scoring, cv, n_jobs=1)
Would there be any no too difficult way to parallelize jobs in nested cross-validation, which would allow to highly reduce time-consuming computing ?
Thanks in advance !
Best, Matthieu
Issue Analytics
- State:
- Created 6 years ago
- Comments:18 (14 by maintainers)
Top Results From Across the Web
How to speed up nested cross validation in python?
Any guidance on how I could speed this up would be appreciated. Edit: I have also tried using parallel processing with dask, but...
Read more >MPI-based Nested Cross-Validation for scikit-learn
If you are working with machine learning, at some point you have to choose hyper-parameters for your model of choice and do cross-validation...
Read more >Nested Cross-Validation for Machine Learning with Python
Nested cross -validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of ...
Read more >nestedcv
Nested cross -validation (CV) provides a way to get round this, by maximising use of the whole dataset for testing overall accuracy, while...
Read more >Nested cross-validation - Statistics - Julia Discourse
According to quite some literature, cross-validation (CV) is biased and should be replaced by nested cross-validation whenever you can, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Although, this is closed I think it would be good for an update, for users who land here now:
What would be the current best practise for parallelism in nested cross validation with sklearn today? Nov 2022: sklearn stable v. 1.1.3, or 1.2.dev0 would be the relevant recent versions @mattvan83 @NicolasHug
Running inside a jupyter notebook, I am trying to use parallel computation on a server (120 cpu cores) like so:
The
pipe
is my estimator object, which itself is a sklearn.pipeline wrapping some transformers and various options for estimators specified in myparams
grid.It runs without errors, however, I am not sure if it is completely optimised. Some time during the fit I see load on all CPUs but most of the time just 10 of them get to work. I assume this is due to the
cv=10
I am using here (though, is it the inner or outer the gets parallelised?).The times when all CPUs are in use might be when an estimator is tested which has some internal (numpy) parallelisation, I assume?
So, is this a tangible way today to approach nested CV parallelisation in sklearn today? …Or would it be better to:
n_jobs
in inner and outer CV individually instead of using a context manager?n_jobs=-1
?Any guidance welcome! Thanks, robna.
I don’t think we have a solution to this yet