Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dataframe merge returns incorrect result when data is from multi-partition parquet

See original GitHub issue

Similar to issue #2237 , I am seeing dask dataframe merge result can be incorrect when data is from multi-partition parquet. pandas merge always produces correct results.

I have put my test data in https://github.com/ningsean/dask_merge_issue I am running a local cluster at localhost:9898

My setup is Python 3.6.1, with the following module versions.

dask.__version__
'0.15.2+25.ga1a75a7'

numpy.__version__
'1.13.1'

partd.__version__
'0.3.8'

pandas.__version__
'0.20.1'

Here is the code that shows the problem.

from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd

client = Client("localhost:9898") 

ddf1 = dd.read_parquet("keys_133m.parq")
ddf1.columns
#Index(['KEY'], dtype='object')
ddf1.dtypes
#KEY    int64

ddf2 = dd.read_parquet("key_ids_133m.parq")
ddf2.columns
#Index(['KEY', 'ID'], dtype='object')
ddf2.dtypes
#KEY    int64
#ID     int64

pandas inner merge result has about 500k rows

df = ddf1.compute().merge(ddf2.compute(),how='inner',on='KEY')
df.shape
#(505847, 2)

dask inner merge has only about 272k rows

ddf = ddf1.merge(ddf2,how='inner',on='KEY')
len(ddf)
#272051

One example of missing row in ddf is KEY==135337982:

#dask merge result does not contain the row
ddf.loc[ddf.KEY==135337982].compute()
#Empty DataFrame
#Columns: [KEY, ID]
#Index: []

#pandas merge result contains the row
df.loc[df.KEY==135337982]
#              KEY         ID
#122842  135337982  149107799

Single-partition parquet works

#use nlargest to write to a single-partition parquet
ddf1.nlargest(n=505850,columns='KEY').to_parquet('keys_one_partition.parq')
ddf2.nlargest(n=1960255,columns='KEY').to_parquet('key_ids_one_partition.parq')

ddf1_one = dd.read_parquet('keys_one_partition.parq')
ddf2_one = dd.read_parquet('key_ids_one_partition.parq')

ddf_one = ddf1_one.merge(ddf2_one,how='inner',on='KEY')
len(ddf_one)
#505847

df_one = ddf1_one.compute().merge(ddf2_one.compute(),how='inner',on='KEY')
df_one.shape
#(505847, 2)

Issue Analytics

State:
Created 6 years ago
Comments:21 (14 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Feb 23, 2018

See https://github.com/dask/dask/pull/3201 for a fix

On Fri, Feb 23, 2018 at 9:47 AM, Hussain Sultan notifications@github.com wrote:

cc: @jcrist https://github.com/jcrist

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/2730#issuecomment-368029197, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHGuaWp2LSb711L_ldmTKqaRkMw3ks5tXs-TgaJpZM4Po9en .

1reaction

hussainsultancommented, Feb 21, 2018

yea i will play around some more to reproduce this with synthetic data

Top Results From Across the Web

Different types of data in a same column in Parquet partition

parquet " has String data for X column, and "part2. gz. parquet" has Double data in a same column. But they aren't working...

4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly

If not given, the default behavior is each partition to correspond to a parquet file. The text-based formats (CSV, fixed-width, etc.) take a...

DataFrame.merge - Dask documentation

dataframe.multi.align_partitions . Afterwards, each partition is merged with the pandas merge function. Joining one on index and one on column.

Parquet Files - Spark 3.3.1 Documentation

Hive/Parquet Schema Reconciliation The reconciliation rules are: Fields that have the same name in both schema must have the same data type regardless...

pandas.DataFrame.to_parquet

This function writes the dataframe as a parquet file. ... If None, the result is returned as bytes. ... Column names by which...