question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using multiprocessing scheduler with a repartitioned dataframe is slow

See original GitHub issue
import distributed
c = distributed.Client()

pdf = pd.DataFrame(dict(A = ['A', 'B'] * 1000000, B = ['A', 'B'] * 1000000, C = ['A', 'B'] * 1000000))
ddf = dd.from_pandas(pdf, npartitions = 1000)
%ddf.repartition(npartitions = 10).compute(get = dask.multiprocessing.get)
%ddf.compute(get = dask.multiprocessing.get)
%ddf.repartition(npartitions = 10).compute(get = c.get)
%ddf.compute(get = c.get)
CPU times: user 30.6 s, sys: 32.7 s, total: 1min 3s
Wall time: 44.4 s
CPU times: user 648 ms, sys: 3.25 s, total: 3.9 s
Wall time: 3.92 s
CPU times: user 2.76 s, sys: 604 ms, total: 3.36 s
Wall time: 4.23 s
CPU times: user 4.2 s, sys: 692 ms, total: 4.89 s
Wall time: 4.61 s

The multiprocessing scheduler seems to be an order of a magnitude slower than the distributed scheduler in case the dataframe is repartitioned. Could this be optimized?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Mar 3, 2017

For full generality choosing that subdataframe is itself a task.

On Fri, Mar 3, 2017 at 5:52 PM, abast notifications@github.com wrote:

Okay, I guess my wrong assumption was that after re-partitioning it is possible to only send the respective “subdataframe” to each worker. Thanks for all your explanations. Closing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/2033#issuecomment-284092765, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszGgV8r_23ffmyrYcsFMS9yjDyqdRks5riJmdgaJpZM4MSj5E .

1reaction
mrocklincommented, Mar 3, 2017

You move the pandas dataframe to each process for every part of the repartition call (which happens ten times). You should probably use from_pandas(df, npartitions=10). Or better yet, just use the distributed scheduler which will handle this more intelligently.

I guess my question is really “why isn’t the solution to this issue just to use the distributed scheduler?”

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python : Multiprocessing with a huge dataframe is pretty slow
The first step is to reduce the data and keep only the events with signals. We consider a signal when we have a...
Read more >
How to efficiently parallelize Dask Dataframe computation on ...
In this post, we will have a look at the different schedulers that can be employed to parallelize code execution and determine which...
Read more >
slow things first! (intermediate) anthony explains #348
today I talk about multiprocessing and scheduling uneven work and why you generally want slow things to go first!playlist: ...
Read more >
Best practices to scale Apache Spark jobs and partition data ...
Repartitioning a dataset by using the repartition or coalesce functions often results in AWS Glue workers exchanging (shuffling) data, which can ...
Read more >
Comprehensive Guide to Parallel Processing in SAP Data ...
So overall we have a processing operator that is 4x slower than the ... To make use of the multiprocessing library, we must...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found