question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use Blockwise/`map_partitions` in various DataFrame join methods

See original GitHub issue

I noticed that some join methods have things like

        dsk = {
            (name, i): (apply, merge_chunk, [left_key, right_key], kwargs)
            for i, right_key in enumerate(right.__dask_keys__())
        }

where we’re generating a low-level graph that could just be done with map_partitions. Using map_partitions in these scenarios would both speed up graph transmission and allow for blockwise fusion across the operations. Refactoring this simple sorts of graphs should be straightforward.

  • single_partition_join
  • hash_join’s merge_chunk
  • stack_partitions should use HighLevelGraph.from_collections instead of merging all of the input graphs

cc @rjzamora @ncclementi @jrbourbeau

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
gjoseph92commented, Nov 3, 2021

I just realized this is slightly more important than I’d originally thought, because now that low-level optimization is turned off for DataFrames, the only way we get task fusion is through HighLevelGraphs. So even simple linear chains won’t be fused, exposing us to root task overproduction (https://github.com/dask/distributed/issues/5223).

For example, this means that a single_partition_join followed by a map_partitions operation may have worse memory performance than doing the join yourself within a map_partitions, since lots of extra single_partition_join outputs can accumulate in memory.

cc @jrbourbeau @ncclementi

0reactions
jrbourbeaucommented, Nov 15, 2021

Re-opening to continue to track here. I’ve also updated the original post to be a checklist instead of a bulleted list (hope that’s okay)

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.DataFrame.join — pandas 1.5.2 documentation
DataFrame.join always uses other 's index but we can use any column in df . This method preserves the original DataFrame's index in...
Read more >
Joins in Pandas: Master the Different Types of Joins in Python
Master the art of performing joins in pandas. In this blog you will learn about different types of joins and how to perform...
Read more >
Different Types of Joins in Pandas - GeeksforGeeks
The pandas module contains various features to perform various operations on dataframes like join, concatenate, delete, add, etc.
Read more >
Joining Pandas Dataframes - Data Carpentry
We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of...
Read more >
Pandas Join Explained With Examples
pandas join () is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found