Using multiprocessing scheduler with a repartitioned dataframe is slow
See original GitHub issueimport distributed
c = distributed.Client()
pdf = pd.DataFrame(dict(A = ['A', 'B'] * 1000000, B = ['A', 'B'] * 1000000, C = ['A', 'B'] * 1000000))
ddf = dd.from_pandas(pdf, npartitions = 1000)
%ddf.repartition(npartitions = 10).compute(get = dask.multiprocessing.get)
%ddf.compute(get = dask.multiprocessing.get)
%ddf.repartition(npartitions = 10).compute(get = c.get)
%ddf.compute(get = c.get)
CPU times: user 30.6 s, sys: 32.7 s, total: 1min 3s
Wall time: 44.4 s
CPU times: user 648 ms, sys: 3.25 s, total: 3.9 s
Wall time: 3.92 s
CPU times: user 2.76 s, sys: 604 ms, total: 3.36 s
Wall time: 4.23 s
CPU times: user 4.2 s, sys: 692 ms, total: 4.89 s
Wall time: 4.61 s
The multiprocessing scheduler seems to be an order of a magnitude slower than the distributed scheduler in case the dataframe is repartitioned. Could this be optimized?
Issue Analytics
- State:
- Created 7 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Python : Multiprocessing with a huge dataframe is pretty slow
The first step is to reduce the data and keep only the events with signals. We consider a signal when we have a...
Read more >How to efficiently parallelize Dask Dataframe computation on ...
In this post, we will have a look at the different schedulers that can be employed to parallelize code execution and determine which...
Read more >slow things first! (intermediate) anthony explains #348
today I talk about multiprocessing and scheduling uneven work and why you generally want slow things to go first!playlist: ...
Read more >Best practices to scale Apache Spark jobs and partition data ...
Repartitioning a dataset by using the repartition or coalesce functions often results in AWS Glue workers exchanging (shuffling) data, which can ...
Read more >Comprehensive Guide to Parallel Processing in SAP Data ...
So overall we have a processing operator that is 4x slower than the ... To make use of the multiprocessing library, we must...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For full generality choosing that subdataframe is itself a task.
On Fri, Mar 3, 2017 at 5:52 PM, abast notifications@github.com wrote:
You move the pandas dataframe to each process for every part of the repartition call (which happens ten times). You should probably use
from_pandas(df, npartitions=10)
. Or better yet, just use the distributed scheduler which will handle this more intelligently.I guess my question is really “why isn’t the solution to this issue just to use the distributed scheduler?”