question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dataframe merge returns incorrect result when data is from multi-partition parquet

See original GitHub issue

Similar to issue #2237 , I am seeing dask dataframe merge result can be incorrect when data is from multi-partition parquet. pandas merge always produces correct results.

I have put my test data in https://github.com/ningsean/dask_merge_issue I am running a local cluster at localhost:9898

My setup is Python 3.6.1, with the following module versions.

dask.__version__
'0.15.2+25.ga1a75a7'

numpy.__version__
'1.13.1'

partd.__version__
'0.3.8'

pandas.__version__
'0.20.1'

Here is the code that shows the problem.

from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd

client = Client("localhost:9898") 

ddf1 = dd.read_parquet("keys_133m.parq")
ddf1.columns
#Index(['KEY'], dtype='object')
ddf1.dtypes
#KEY    int64

ddf2 = dd.read_parquet("key_ids_133m.parq")
ddf2.columns
#Index(['KEY', 'ID'], dtype='object')
ddf2.dtypes
#KEY    int64
#ID     int64

pandas inner merge result has about 500k rows

df = ddf1.compute().merge(ddf2.compute(),how='inner',on='KEY')
df.shape
#(505847, 2)

dask inner merge has only about 272k rows

ddf = ddf1.merge(ddf2,how='inner',on='KEY')
len(ddf)
#272051

One example of missing row in ddf is KEY==135337982:

#dask merge result does not contain the row
ddf.loc[ddf.KEY==135337982].compute()
#Empty DataFrame
#Columns: [KEY, ID]
#Index: []

#pandas merge result contains the row
df.loc[df.KEY==135337982]
#              KEY         ID
#122842  135337982  149107799

Single-partition parquet works

#use nlargest to write to a single-partition parquet
ddf1.nlargest(n=505850,columns='KEY').to_parquet('keys_one_partition.parq')
ddf2.nlargest(n=1960255,columns='KEY').to_parquet('key_ids_one_partition.parq')

ddf1_one = dd.read_parquet('keys_one_partition.parq')
ddf2_one = dd.read_parquet('key_ids_one_partition.parq')

ddf_one = ddf1_one.merge(ddf2_one,how='inner',on='KEY')
len(ddf_one)
#505847

df_one = ddf1_one.compute().merge(ddf2_one.compute(),how='inner',on='KEY')
df_one.shape
#(505847, 2)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:21 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Feb 23, 2018

See https://github.com/dask/dask/pull/3201 for a fix

On Fri, Feb 23, 2018 at 9:47 AM, Hussain Sultan notifications@github.com wrote:

cc: @jcrist https://github.com/jcrist

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/2730#issuecomment-368029197, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHGuaWp2LSb711L_ldmTKqaRkMw3ks5tXs-TgaJpZM4Po9en .

1reaction
hussainsultancommented, Feb 21, 2018

yea i will play around some more to reproduce this with synthetic data

Read more comments on GitHub >

github_iconTop Results From Across the Web

Different types of data in a same column in Parquet partition
parquet " has String data for X column, and "part2. gz. parquet" has Double data in a same column. But they aren't working...
Read more >
4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly
If not given, the default behavior is each partition to correspond to a parquet file. The text-based formats (CSV, fixed-width, etc.) take a...
Read more >
DataFrame.merge - Dask documentation
dataframe.multi.align_partitions . Afterwards, each partition is merged with the pandas merge function. Joining one on index and one on column.
Read more >
Parquet Files - Spark 3.3.1 Documentation
Hive/Parquet Schema Reconciliation​​ The reconciliation rules are: Fields that have the same name in both schema must have the same data type regardless...
Read more >
pandas.DataFrame.to_parquet
This function writes the dataframe as a parquet file. ... If None, the result is returned as bytes. ... Column names by which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found