Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using multiprocessing scheduler with a repartitioned dataframe is slow

See original GitHub issue

import distributed
c = distributed.Client()

pdf = pd.DataFrame(dict(A = ['A', 'B'] * 1000000, B = ['A', 'B'] * 1000000, C = ['A', 'B'] * 1000000))
ddf = dd.from_pandas(pdf, npartitions = 1000)
%ddf.repartition(npartitions = 10).compute(get = dask.multiprocessing.get)
%ddf.compute(get = dask.multiprocessing.get)
%ddf.repartition(npartitions = 10).compute(get = c.get)
%ddf.compute(get = c.get)

CPU times: user 30.6 s, sys: 32.7 s, total: 1min 3s
Wall time: 44.4 s
CPU times: user 648 ms, sys: 3.25 s, total: 3.9 s
Wall time: 3.92 s
CPU times: user 2.76 s, sys: 604 ms, total: 3.36 s
Wall time: 4.23 s
CPU times: user 4.2 s, sys: 692 ms, total: 4.89 s
Wall time: 4.61 s

The multiprocessing scheduler seems to be an order of a magnitude slower than the distributed scheduler in case the dataframe is repartitioned. Could this be optimized?

Issue Analytics

State:
Created 7 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Mar 3, 2017

For full generality choosing that subdataframe is itself a task.

On Fri, Mar 3, 2017 at 5:52 PM, abast notifications@github.com wrote:

Okay, I guess my wrong assumption was that after re-partitioning it is possible to only send the respective “subdataframe” to each worker. Thanks for all your explanations. Closing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/2033#issuecomment-284092765, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszGgV8r_23ffmyrYcsFMS9yjDyqdRks5riJmdgaJpZM4MSj5E .

1reaction

mrocklincommented, Mar 3, 2017

You move the pandas dataframe to each process for every part of the repartition call (which happens ten times). You should probably use from_pandas(df, npartitions=10). Or better yet, just use the distributed scheduler which will handle this more intelligently.

I guess my question is really “why isn’t the solution to this issue just to use the distributed scheduler?”