Use Blockwise/`map_partitions` in various DataFrame join methods
See original GitHub issueI noticed that some join methods have things like
dsk = {
(name, i): (apply, merge_chunk, [left_key, right_key], kwargs)
for i, right_key in enumerate(right.__dask_keys__())
}
where we’re generating a low-level graph that could just be done with map_partitions
. Using map_partitions
in these scenarios would both speed up graph transmission and allow for blockwise fusion across the operations. Refactoring this simple sorts of graphs should be straightforward.
-
single_partition_join
-
hash_join
’smerge_chunk
-
stack_partitions
should useHighLevelGraph.from_collections
instead of merging all of the input graphs
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
pandas.DataFrame.join — pandas 1.5.2 documentation
DataFrame.join always uses other 's index but we can use any column in df . This method preserves the original DataFrame's index in...
Read more >Joins in Pandas: Master the Different Types of Joins in Python
Master the art of performing joins in pandas. In this blog you will learn about different types of joins and how to perform...
Read more >Different Types of Joins in Pandas - GeeksforGeeks
The pandas module contains various features to perform various operations on dataframes like join, concatenate, delete, add, etc.
Read more >Joining Pandas Dataframes - Data Carpentry
We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of...
Read more >Pandas Join Explained With Examples
pandas join () is similar to SQL join where it combines columns from multiple DataFrames based on row indices. In pandas join can...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I just realized this is slightly more important than I’d originally thought, because now that low-level optimization is turned off for DataFrames, the only way we get task fusion is through HighLevelGraphs. So even simple linear chains won’t be fused, exposing us to root task overproduction (https://github.com/dask/distributed/issues/5223).
For example, this means that a
single_partition_join
followed by amap_partitions
operation may have worse memory performance than doing the join yourself within amap_partitions
, since lots of extrasingle_partition_join
outputs can accumulate in memory.cc @jrbourbeau @ncclementi
Re-opening to continue to track here. I’ve also updated the original post to be a checklist instead of a bulleted list (hope that’s okay)