Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to run cross-validation in parallel mode "processes"

See original GitHub issue

Hello, I’m using Prophet v1.0 with Anaconda3 2020.11 on Windows 10 64-bit. I’m trying to run cross-validation in parallel mode “processes” using the example provided in the documentation, but I always get this error message (the error.log is very long so I attached it instead of pasting it here). The code I used:

import pandas as pd
import itertools
import numpy as np
from prophet import Prophet
from prophet.diagnostics import cross_validation
from prophet.diagnostics import performance_metrics

df = pd.read_csv("example_wp_log_peyton_manning.csv")

param_grid = {  
    "changepoint_prior_scale": [0.001, 0.01, 0.1, 0.5],
    "seasonality_prior_scale": [0.01, 0.1, 1.0, 10.0],
}

# Generate all combinations of parameters
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
rmses = []  # Store the RMSEs for each params here

# Use cross validation to evaluate all parameters
for params in all_params:
    m = Prophet(**params).fit(df)  # Fit model with given params
    df_cv = cross_validation(m, horizon="30 days", parallel="processes")
    df_p = performance_metrics(df_cv, rolling_window=1)
    rmses.append(df_p["rmse"].values[0])

# Find the best parameters
tuning_results = pd.DataFrame(all_params)
tuning_results["rmse"] = rmses
print(tuning_results)

If I run the code on Google Colab then everything is fine.

So can anyone help please? Thank you.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:11 (3 by maintainers)

Top GitHub Comments

1reaction

nvietcommented, May 11, 2021

Thanks for your suggestion. Unfortunately the fork value is available on Unix only and it’s the default on Unix. On Windows the only available value is spawn. This information is available in the Python’s official document too.

Trying to set the argument to fork in Windows will result in this error message:

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    multiprocessing.set_start_method("fork")
  File "d:\Programs\anaconda3\envs\myenv\lib\multiprocessing\context.py", line 246, in set_start_method
    self._actual_context = self.get_context(method)
  File "d:\Programs\anaconda3\envs\myenv\lib\multiprocessing\context.py", line 238, in get_context
    return super().get_context(method)
  File "d:\Programs\anaconda3\envs\myenv\lib\multiprocessing\context.py", line 192, in get_context
    raise ValueError('cannot find context for %r' % method) from None
ValueError: cannot find context for 'fork'

The issue lies in the line pool = concurrent.futures.ProcessPoolExecutor() in the file diagnostics.py. As shared in an answer to a question on StackOverflow on parallelism on Windows:

Multiprocessing works differently on ms-windows because that OS lacks the fork system call used on UNIX and macOS.

fork creates the child process as a perfect copy of the parent process. All the code and data in both processes are the same. The only difference being the return value of the fork call. (That is to let the new process know it is a copy.) So the child process has access to (a copy of) all the data in the parent process.

On ms-windows, multiprocessing tries to “fake” fork by launching a new python interpreter and have it import your main module. This means (among other things) that your main module has to be importable without side effects such as starting another process. Hence the reason for if __name__ == '__main__'. It also means that your worker processes might or might not have access to data created in the parent process, depending on where it is created. It will have access to anything created before __main__. But it would not have access to anything created inside the main block.

1reaction

nvietcommented, May 1, 2021

According to #1434, it seems running cross-validation with parallel set to threads is much slower than setting it to processes. I’m unable to build prophet on Windows so my another workaround is to use the WSL. In my case there is a huge difference in term of execution time:

Cross-validation with parallel set to processes in WSL (Debian, without Anaconda): 0:02:44.19 (100% CPU usage)
Cross-validation with parallel set to threads in Windows (with Anaconda): 0:08:40.13 (only around 50% CPU usage)

Hope that in the future there will be someone who can help to debug this issue.

Top Results From Across the Web

Diagnostics | Prophet - Meta Open Source

Prophet is a forecasting procedure implemented in R and Python. ... Cross-validation can also be run in parallel mode in Python, by setting...

sklearn.cross_validation.cross_val_score multiple cpu?

I am trying to get a score for a model through cross validation with sklearn.cross_validation.cross_val_score. According to ...

4 Cross Validation Methods | Introduction to Applied Machine ...

We use cross validation for two goals: To select among model configurations; To evaluate the performance of our models in new data.

Training modes and algorithm support - Amazon SageMaker

To find the best combination for your dataset, ensemble mode runs 10 trials with different model and meta parameter settings. Then Autopilot combines...

k-fold cross-validation explained in plain English

The cv hyperparameter represents the number of folds (here it is 5). By providing appropriate values to n_jobs, we can do parallel computations....