Infinite loop in dd.merge_asof with empty df on RHS
See original GitHub issueWhat happened:
Dask will enter an infinite loop in pair_partitions (https://github.com/dask/dask/blob/main/dask/dataframe/multi.py#L736) if dd.merge_asof is passed a specific empty dataframe on the right hand side.
What you expected to happen:
dd.merge_asof to return the correct result.
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
import numpy as np
import dask
# some data
left_df = pd.DataFrame({'time': pd.date_range(start='20200101', end='20200108'), 'value': list(range(8))})
# in real usage right df gets created via user filtering/other more complicated actions
right_df = left_df[left_df.time<'2020-01-01'].copy().rename(columns={'value': "another_value"})
left_dd = dd.from_pandas(left_df, npartitions=2)
right_dd = dd.from_pandas(right_df, npartitions=2)
dd.merge_asof(
left_dd.set_index("time"),
right_dd.set_index("time"),
left_index=True,
right_index=True,
direction="backward",
tolerance=pd.Timedelta(days=1),
)
I haven’t had a chance to track down exactly why this happens, but some debugging notes:
left_dd/right_dd must be partitioned. npartitions=1 will execute just fine.
Some experimentation shows that you can trigger this with just
dd.merge_asof(
left_dd.set_index("time"),
right_dd.set_index("time"),
left_index=True,
right_index=True,
)
It seems I’m able to trigger this with “simpler” indicies:
left_df = pd.DataFrame({'time': list(range(8)), 'value': list(range(8))})
right_df = left_df[left_df.time<0].copy().rename(columns={'value': "another_value"})
Environment:
- Dask version: 2021.06.2 (I’ve also confirmed that this existed in 2021.04.0)
- Python version: 3.7.9
- Operating System: Linux
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
pandas merging dataframes in a loop - python - Stack Overflow
Create an empty DataFrame with the columns to prevent the "key error: Code" df = pd.DataFrame(columns=['Code']). then in the loop, you
Read more >merge_asof pandas Code Example - Code Grepper
A value is trying to be set on a copy of a slice from a DataFrame. ... How to convert an integer number...
Read more >10.1057/9780230294905.pdf - Springer Link
Series Editor: Harukiyo Hasegawa is Professor at Doshisha Business School, Kyoto, Japan, and. Honourable Research Fellow at the University ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Yes! That would be great!
That sounds reasonable to me.
Would you like me to update that error text?
Sure - I’ll put something together. I think I was going to try changing
merge_asof_indexedto handle the case ofall(map(pd.isnull, right.divisions)and keepingpair_partitionslogic the same. Does that sound reasonable?