Joblib + np.random.permutation doesn't generate random values if n_jobs > 1
See original GitHub issueTest case:
import numpy as np
import joblib
def permute(array):
# With one row only for simplicity
new_array = np.random.permutation(array[0,])
print new_array[0:5] # To show the problem
data = np.random.randn(4, 1000)
pool = joblib.Parallel(n_jobs=2)
bad = pool(joblib.delayed(permute)(data) for item in range(1,11))
[-1.73202122 -1.03073665 -0.74896312 1.33868546 0.5940473 ]
[-1.73202122 -1.03073665 -0.74896312 1.33868546 0.5940473 ]
[-0.5100796 -0.20546035 -0.60574384 -0.92523779 2.05956018]
[-0.5100796 -0.20546035 -0.60574384 -0.92523779 2.05956018]
[ 1.99157153 -0.49067723 1.64050311 0.24392144 0.41525179]
[ 1.99157153 -0.49067723 1.64050311 0.24392144 0.41525179]
Notice how the numbers are equal in groups of 2, exactly the number of jobs specified in Parallel. The only time this is not observed is when n_jobs = 1 (i.e., no parallelism).
This effect is only present if np.random.permutation is used. random.shuffle() from the stdlib works fine. However I’m not sure if the bug is in joblib or in numpy. If it’s not in joblib, feel free to close this issue.
Issue Analytics
- State:
- Created 11 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Random state within joblib.Parallel - Read the Docs
A solution is to set the random state within the function which is passed to joblib. Parallel . stochastic_function_seeded accepts as argument a...
Read more >Why does joblib parallel execution make runtime much slower?
When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the...
Read more >Using Joblib and reproducible random numbers
Get random numbers in parallel, reproducibly. Even when setting the random seed in the above, we did not get reproducible results. To make...
Read more >sklearn.decomposition.MiniBatchDictionaryLearning
Used for initializing the dictionary when dict_init is not specified, randomly shuffling the data when shuffle is set to True , and updating...
Read more >Source code for statsmodels.nonparametric._kernel_base
if randomize: np.random.shuffle(data) sub_data = data[:n_sub, :] else: sub_data ... else: raise ValueError("class_type not recognized, should be one of ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Indeed. This is a common problem when doing random number generation in parallel.
Ideally, I think that you should avoid using an external RNG, because you loose traceability. I would, in the main program, use the main RNG to choose a seed for the sons (of course ensuring that all the seeds are different). This a pattern that we use a lot in the scikit-learn codebase.
I think the problem stems from the fact that multiple workers are starting out from the same
RandomState
, and so you end up drawing identical sets of indices for your shuffling. One workaround is to reset the random seed on each iteration (you don’t need to explicitly set it to a different value - you can just set it toNone
and it will get a new seed value from/dev/urandom
or the wall clock).Same seed
Output:
Different seeds
Output: