question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scheduler.get_comm_cost a significant portion of runtime in merge benchmarks

See original GitHub issue

I’ve been profiling distributed workflows in an effort to understand where there are potential performance improvements to be made (this is ongoing with @gjoseph92 amongst others). I’m particularly interested in scale-out scenarios, where the number of workers is large. As well as that scenario, I’ve also been looking at cases where the number of works is quite small, but dataframes have many partitions: this produces many tasks at a scale where debugging/profiling is a bit more manageable.

The benchmark setup I have builds two dataframes and then merges them on a key column with a specified matching fraction. Each worker gets P partitions with N rows per partition. I use 8 workers. I’m using cudf dataframes (so the merge itself is fast, which means that I notice sequential overheads sooner).

Attached two speedscope plots (and data) of py-spy based profiling of the scheduler in a scenario with eight workers, P=100, and N=500,000. In a shuffle, the total number of tasks peaks at about 150,000 per the dashboard. The second profile is very noisy since I’m using https://github.com/benfred/py-spy/pull/497 to avoid filtering out python builtins (so that we can see in more detail what is happening). Interestingly, at this scale we don’t see much of a pause in GC (but I am happy to try out more scenarios that might be relevant to #4987).

In this scenario, a single merge takes around 90s, if I do the minimal thing of letting Scheduler.get_comm_cost return 0 immediately, this drops to around 50s (using pandas it drops from 170s to around 130s). From the detailed profile, we can see that the majority of this time is spent in set.difference. I’m sure there’s a more reasonable fix that isn’t quite such a large hammer.

py-spy-scheduler-100-chunks-per-worker py-spy-scheduler-100-chunks-per-worker-detailed

merge-scheduler-100-chunks-per-worker-no-filter.json.gz merge-scheduler-100-chunks-per-worker.json.gz

(cc @pentschev, @quasiben, and @rjzamora)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
wence-commented, Aug 23, 2022

Just checking in here, did https://github.com/dask/distributed/pull/6931 close this issue, or is there more that folks would like to do?

Yes, it did, I’m about to follow up more coherently to @gjoseph92’s last query with a separate issue.

1reaction
wence-commented, Aug 18, 2022

Might actually be better off simply looping here

for dts in ts.dependencies:
    if dts not in ws.has_what:
        nbytes += dts.nbytes

len(ts.dependencies) is typically reasonably small but ws.has_what can be huge, i.e. rehashing is a bad idea but getitem / isin should be fast

I will try this out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Merge Sort – Algorithm, Source Code, Time Complexity
We want to sort the array [3, 7, 1, 8, 2, 5, 9, 4, 6] known from the previous parts of the series....
Read more >
Merge sort - Wikipedia
In computer science, merge sort is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found