question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Infinite loop in dd.merge_asof with empty df on RHS

See original GitHub issue

What happened:

Dask will enter an infinite loop in pair_partitions (https://github.com/dask/dask/blob/main/dask/dataframe/multi.py#L736) if dd.merge_asof is passed a specific empty dataframe on the right hand side.

What you expected to happen:

dd.merge_asof to return the correct result.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd
import numpy as np
import dask

# some data
left_df = pd.DataFrame({'time': pd.date_range(start='20200101', end='20200108'), 'value': list(range(8))})

# in real usage right df gets created via user filtering/other more complicated actions
right_df = left_df[left_df.time<'2020-01-01'].copy().rename(columns={'value': "another_value"})

left_dd = dd.from_pandas(left_df, npartitions=2)
right_dd = dd.from_pandas(right_df, npartitions=2)

dd.merge_asof(
    left_dd.set_index("time"),
    right_dd.set_index("time"),
    left_index=True, 
    right_index=True,
    direction="backward",
    tolerance=pd.Timedelta(days=1),
)

I haven’t had a chance to track down exactly why this happens, but some debugging notes:

left_dd/right_dd must be partitioned. npartitions=1 will execute just fine.

Some experimentation shows that you can trigger this with just

dd.merge_asof(
    left_dd.set_index("time"),
    right_dd.set_index("time"),
    left_index=True, 
    right_index=True,
)

It seems I’m able to trigger this with “simpler” indicies:

left_df = pd.DataFrame({'time':  list(range(8)), 'value': list(range(8))})
right_df = left_df[left_df.time<0].copy().rename(columns={'value': "another_value"})

Environment:

  • Dask version: 2021.06.2 (I’ve also confirmed that this existed in 2021.04.0)
  • Python version: 3.7.9
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Jun 28, 2021

Would you like me to update that error text?

Yes! That would be great!

I think I was going to try changing merge_asof_indexed to handle the case of all(map(pd.isnull, right.divisions) and keeping pair_partitions logic the same. Does that sound reasonable?

That sounds reasonable to me.

0reactions
gerrymanoimcommented, Jun 28, 2021

So I’m not sure if it’ll be possible to make https://github.com/dask/dask/blob/main/dask/dataframe/multi.py#L879 really check if the dataframes are empty.

Would you like me to update that error text?

I think that you are right that fixing pair_partitions is a better bet. Are you able to open a pull request?

Sure - I’ll put something together. I think I was going to try changing merge_asof_indexed to handle the case of all(map(pd.isnull, right.divisions) and keeping pair_partitions logic the same. Does that sound reasonable?

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas merging dataframes in a loop - python - Stack Overflow
Create an empty DataFrame with the columns to prevent the "key error: Code" df = pd.DataFrame(columns=['Code']). then in the loop, you
Read more >
merge_asof pandas Code Example - Code Grepper
A value is trying to be set on a copy of a slice from a DataFrame. ... How to convert an integer number...
Read more >
10.1057/9780230294905.pdf - Springer Link
Series Editor: Harukiyo Hasegawa is Professor at Doshisha Business School, Kyoto, Japan, and. Honourable Research Fellow at the University ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found