Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mac/Linux with multiprocessing, all workers are seeded the same random state

See original GitHub issue

Reproducing code example:

Full code is here, I will leave this branch untouched so you can see the behavior I’m talking about:

https://github.com/FlorinAndrei/nsphere/tree/numpy-mp

On Mac or Linux, edit xpu_workers.py and comment out the rseed lines, and the bug will be triggered.

You can tell the bug has been triggered because there are very few dots in the Monte Carlo simulation graph, in the Jupyter notebook. There are supposedly 100 dots there, but due to the bug there are far fewer - and the whole population is far less random, which affects the app as a whole.

What’s really going on:

I create a pool of workers with:

import multiprocessing
from multiprocessing import Pool

p = Pool(processes = num_p)        
arglist = [(points, d, num_p, sysmem, gpumem, pointloops)] * num_p
work_out = p.map(make_dots, arglist)

And within the worker I have something like this:

pts = np.random.random_sample((points, d)) - 0.5

Parts of the pts array are returned as samples from all workers to the master process, and are collated in the work_out matrix. Each worker is supposed to make random samples - and of course the expectation is that each sample is different. https://dilbert.com/strip/2001-10-25

On Windows this works great.

On Mac and Linux, all pts arrays are generated with the exact same “random” content. The samples from workers are all identical. Within each sample the content looks random enough (just an eyeball estimate) but all samples coincide perfectly with each other.

It’s a very frustrating bug, hard to figure out the cause, and makes the code misbehave in weird ways.

I have to do this in each worker to get rid of the bug:

rseed = random.randint(0, 4294967296)
xp.random.seed(rseed)

Numpy/Python version information:

1.16.4 3.7.4 (default, Jul  9 2019, 18:13:23) 
[Clang 10.0.1 (clang-1001.0.46.4)]

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

mattipcommented, Oct 16, 2019

Without diving too deeply into your code, I wonder if you have seen the new (as of 1.17) random.BitGenerator api? In particular, you might be interested in the work done to ensure parallel processes get “independent” streams. Please let us know if we could improve the documentation to make it clearer, and if it helps solve your problem.

0reactions

mattipcommented, Nov 4, 2019

Closing. Thanks for the update. Hopefully you will try the new API.