question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running RandomizedGridSearchCV results in inability to perform further tasks in multiprocessing

See original GitHub issue

Describe the bug

After having run RandomizedSearchCV, subsequent functions execution in a multiprocessing (concurrent.futures.ProcessPoolExecutor) mode freeze.

If running any function in multiprocessing mode before RandomizedSearchCV, everything works fine.

I have tried reducing the n_jobs in RandomizedSearchCV to 1, still subsequent multiprocessing processes freeze. I have also tried to change the default joblib.parallel_backend('loky') to

joblib.parallel_backend('multiprocessing')
joblib.parallel_backend('threading')

…didn’t help.

Interesting note: the problem does not reproduce if a subsequent function is very simple (just print some string). But when any complexity is added to the function, it fails to run after RandomizedSearchCV. Looks like RandomizedSearchCV does not release workers or triggers some other processes.

Here is a small video to illustrate the problem:

https://vimeo.com/user50681456/review/474733642/b712c12c2c

Steps/Code to Reproduce

from xgboost import XGBRegressor
from sklearn.model_selection import KFold
import concurrent.futures
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# DEFINE FUNCTIONS
def simple_func():
    from sklearn.datasets import make_regression
    # JUST CREATING A DATASET, NOT EVEN FITTING ANY MODEL!!! AND IT FREEZES
    data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
    print('Fit complete')

def just_print():
    print('Just printing')

def run_randomized_search_cv():
    data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
    X = pd.DataFrame(data[0])
    y = pd.Series(data[1])
    kf = KFold(n_splits = 3, shuffle = True, random_state = 5)
    model = XGBRegressor()
    params = {
            'min_child_weight':     [0.1, 1, 5],
            'subsample':            [0.5, 0.7, 1.0],
            'colsample_bytree':     [0.5, 0.7, 1.0],
            'eta':                  [0.005, 0.01, 0.1]
            }
    random_search = RandomizedSearchCV(
            model,
            param_distributions =   params,
            n_iter =                25,
            n_jobs =                -1,
            refit =                 True, # necessary for random_search.best_estimator_
            cv =                    kf.split(X,y),
            verbose =               1,
            random_state =          5
            )
    random_search.fit(X, np.array(y))


# STEP 0
# test multiprocessing with concurrent.futures and a simple function
with concurrent.futures.ProcessPoolExecutor() as executor:
    results_temp = [executor.submit(simple_func) for i in range(0,12)]
# ----------------------------------------------------------------------------

# STEP 1
# simulate RandomizedSearchCV
run_randomized_search_cv()
# ----------------------------------------------------------------------------

# STEP 2.0
# test if multiprocessing on a function that just prints
with concurrent.futures.ProcessPoolExecutor() as executor:
    results_temp = [executor.submit(just_print) for i in range(0,12)]
# ----------------------------------------------------------------------------


# STEP 3
# test the function from STEP 0
with concurrent.futures.ProcessPoolExecutor() as executor:
    results_temp = [executor.submit(simple_func) for i in range(0,12)]
# ----------------------------------------------------------------------------

Expected Results

Last call to a function in a multiprocessing mode prints ‘Fit complete’ 12 times.

Actual Results

Last call to a function in a multiprocessing mode freezes.

Versions

System: python: 3.7.6 (default, Jan 8 2020, 13:42:34) [Clang 4.0.1 (tags/RELEASE_401/final)] executable: /Users/danil/anaconda3/bin/python machine: Darwin-19.6.0-x86_64-i386-64bit

Python dependencies: pip: 20.2.3 setuptools: 46.0.0.post20200309 sklearn: 0.22.1 numpy: 1.18.1 scipy: 1.4.1 Cython: 0.29.15 pandas: 1.0.5 matplotlib: 3.3.2 joblib: 0.17.0

Built with OpenMP: True Darwin-19.6.0-x86_64-i386-64bit Python 3.7.6 (default, Jan 8 2020, 13:42:34) [Clang 4.0.1 (tags/RELEASE_401/final)] NumPy 1.18.1 SciPy 1.4.1 Scikit-Learn 0.22.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
cmarmocommented, Nov 23, 2020

Thanks @DanilZherebtsov for reaching out and thanks @rworreby for investigating the issue, this was really helpful. If I understand correctly this is not a bug in scikit-learn: I’m going to close this issue, feel free to reopen if there is still something to solve.

1reaction
DanilZherebtsovcommented, Nov 11, 2020

Hi rworreby, thanks for looking into this issue!

Could you please explain what does your comment “Built with OpenMP: True” mean and how do I check my current status on this setting?

P.S. I have managed to solve the problem by inserting in the beginning of my program:

import multiprocessing
multiprocessing.set_start_method('forkserver')

as this is explained here: https://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux

But the shell looses some level of interactivity as the results intermediate results don’t get printed as the program is executed.

P.S.S. my setuptools version is ‘46.0.0.post20200309’

Read more comments on GitHub >

github_iconTop Results From Across the Web

GridSearchCV freezes on windows with any n_jobs
Open your windows task manager and look what happens while running. Look at your CPU percentage, your RAM and look at the windows...
Read more >
Measuring runtimes for Scikit-learn models — OpenML 0.12.2 ...
(Case 3) Comparing RandomSearchCV and GridSearchCV on the above task based on runtimes ... We'll run a Random Forest model and obtain an...
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
I often run a lot of sanity check grid searches on small samples to get ideas on which direction to push. More data...
Read more >
Try RandomizedSearchCV if GridSearchCV is taking too long
GridSearchCV taking too long? Try RandomizedSearchCV with a small number of iterations. Make sure to specify a distribution (instead of a ...
Read more >
8.3. Parallelism, resource management, and configuration
8.3.1.4. Oversubscription: spawning too many threads¶ · Suppose you have a machine with 8 CPUs. Consider a case where you're running a GridSearchCV...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found