Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joblib + np.random.permutation doesn't generate random values if n_jobs > 1

See original GitHub issue

Test case:


import numpy as np
import joblib

def permute(array):
    # With one row only for simplicity
    new_array = np.random.permutation(array[0,])
    print new_array[0:5] # To show the problem

data = np.random.randn(4, 1000)

pool = joblib.Parallel(n_jobs=2)

bad = pool(joblib.delayed(permute)(data) for item in range(1,11))

[-1.73202122 -1.03073665 -0.74896312  1.33868546  0.5940473 ]
[-1.73202122 -1.03073665 -0.74896312  1.33868546  0.5940473 ]
[-0.5100796  -0.20546035 -0.60574384 -0.92523779  2.05956018]
[-0.5100796  -0.20546035 -0.60574384 -0.92523779  2.05956018]
[ 1.99157153 -0.49067723  1.64050311  0.24392144  0.41525179]
[ 1.99157153 -0.49067723  1.64050311  0.24392144  0.41525179]

Notice how the numbers are equal in groups of 2, exactly the number of jobs specified in Parallel. The only time this is not observed is when n_jobs = 1 (i.e., no parallelism).

This effect is only present if np.random.permutation is used. random.shuffle() from the stdlib works fine. However I’m not sure if the bug is in joblib or in numpy. If it’s not in joblib, feel free to close this issue.

Issue Analytics

State:
Created 11 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

GaelVaroquauxcommented, Mar 26, 2013

I think the problem stems from the fact that multiple workers are starting out from the same RandomState, and so you end up drawing identical sets of indices for your shuffling.

Indeed. This is a common problem when doing random number generation in parallel.

One workaround is to reset the random seed on each iteration (you don’t need to explicitly set it to a different value - you can just set it to None and it will get a new seed value from /dev/urandom or the wall clock).

Ideally, I think that you should avoid using an external RNG, because you loose traceability. I would, in the main program, use the main RNG to choose a seed for the sons (of course ensuring that all the seeds are different). This a pattern that we use a lot in the scikit-learn codebase.

1reaction

alimuldalcommented, Mar 26, 2013

I think the problem stems from the fact that multiple workers are starting out from the same RandomState, and so you end up drawing identical sets of indices for your shuffling. One workaround is to reset the random seed on each iteration (you don’t need to explicitly set it to a different value - you can just set it to None and it will get a new seed value from /dev/urandom or the wall clock).

from joblib import Parallel,delayed
import numpy as np

Same seed

def printrand(n):
    print np.random.randn(n)

Parallel(-1)(delayed(printrand)(5) for ii in xrange(10))

Output:

[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]
[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]
[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]
[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]
[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]
[ 0.45949328  0.85346312  0.50735963  1.17380902  1.79260382]
[ 0.45949328  0.85346312  0.50735963  1.17380902  1.79260382]
[ 0.45949328  0.85346312  0.50735963  1.17380902  1.79260382]
[ 0.45949328  0.85346312  0.50735963  1.17380902  1.79260382]
[-0.77270255  1.30150881  1.62902164  0.12962144  0.0572978 ]

Different seeds

def printrand_reset(n):
    np.random.seed(None)
    print np.random.randn(n)

Parallel(-1)(delayed(printrand_reset)(5) for ii in xrange(10))

Output:

[-1.60190562 -1.19917872  0.86253328 -0.3302566  -0.35089248]
[-0.41745894  1.7083357   0.48117448 -0.34452831 -0.86861516]
[-0.62401033  0.05867128 -0.33396944  0.24877314 -0.72299892]
[ 0.70851315 -0.45478262  0.32031148 -0.4905077   0.97100491]
[-0.88931379  1.25180661  1.55436345  0.36201995  2.16707236]
[-0.93832491  0.25186    -1.15841661 -1.65701349 -0.93336423]
[-2.2361376  -0.63555898  1.08580025  1.05623793 -0.74393203]
[ 0.89039908 -0.87591919 -0.28456301  0.37701847 -1.5798119 ]
[ 0.04564946 -1.35125669  0.7739229  -1.10262172  0.76594455]
[ 0.40100168 -0.70120013  1.81587574 -1.89946442  0.14213717]

Top Results From Across the Web

Random state within joblib.Parallel - Read the Docs

A solution is to set the random state within the function which is passed to joblib. Parallel . stochastic_function_seeded accepts as argument a...

Why does joblib parallel execution make runtime much slower?

When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the...

Using Joblib and reproducible random numbers

Get random numbers in parallel, reproducibly. Even when setting the random seed in the above, we did not get reproducible results. To make...

sklearn.decomposition.MiniBatchDictionaryLearning

Used for initializing the dictionary when dict_init is not specified, randomly shuffling the data when shuffle is set to True , and updating...

Source code for statsmodels.nonparametric._kernel_base

if randomize: np.random.shuffle(data) sub_data = data[:n_sub, :] else: sub_data ... else: raise ValueError("class_type not recognized, should be one of ...