Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

merge on index vs merge on column - different NaN handling

See original GitHub issue

In the following examples there are three different merge behaviours when it comes to handling NaNs. They are all based on pd.merge(…, how=“left”, …)

The difference depends on:

1) Whether we are merging using an index or a column.

2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).

Arguably, if we specify “left” as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes’ key columns (see first merge in example below, ‘d’ and ‘e’ columns). The problem is, if we are merging on left’s index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don’t match (‘c’ and ‘d’ in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the “key” column of the right dataframe (‘d’ here).

Gotchas:

-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).

-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.

Proposed behaviour:

Maybe it is simply a matter of removing this NaN filling step.

Better yet, the “key” column in the merged dataframe should perhaps bear the name of left’s index not of the “right_on” key (provided we used left’s index to merge). I.e. in the second merge of the example, the ‘d’ column should be called ‘c’.

This is really the source of the confusion when the two names are different. When they are the same the “no NaN” behaviour is arguably legitimate.

Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.

Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.

Looking forward to reading your thoughts!

Code Sample:

df1 = pd.DataFrame(columns=['a','b','c'], data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
df1_c_index = df1.set_index('c')
df2 = pd.DataFrame(columns=['d','e'], data=[[3,14],[6,15],[13,16]])

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='d')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='d'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d'), '\n'

df2.rename(columns={'d':'c'}, inplace=True)

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='c')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='c'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c'), '\n'

Output:

screen shot 2016-06-05 at 21 06 48

screen shot 2016-06-05 at 21 07 05

output of `pd.show_versions()`:

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 15.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8

pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 21.0.0 Cython: None numpy: 1.11.0 scipy: 0.17.1 statsmodels: None xarray: 0.7.2 IPython: 4.2.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.0.0 tables: None numexpr: 2.6.0 matplotlib: 1.5.1 openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: 2.39.0 pandas_datareader: None

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

Dr-Irvcommented, Nov 30, 2016

I started looking at #14076 to see if I can do something intelligent, and then found the discussion here. @jreback has a comment from Jun 6 that is confusing to me. Let’s assume you are doing a left-join, left_index=True and right_on='some_column_name'. There are 3 cases to consider:

The left index has the same name 'some_column_name' as the column being merged on.
The left index has a different name from 'some_column_name'.
The left index has no name at all.

In the first case, this is like the 4th merge in the original example, and I think the behavior is correct.

In the second case, this is like the 2nd merge above, but I think the name of the resulting column should come from the left DataFrame index name, because the left’s DataFrame index is what is preserved as a column in the result. In the 2nd merge example above, it comes from the right, which is confusing.

In the third case, since there is no name for the left index, we could raise an error, or maybe the resulting name of the column should be called “left_index” (or something like that), to make it clear that it was that particular index that is used to create the values for the column.

When the merges involve a MultiIndex, then I think these are the cases:

The left index has the same names as the set of names from the right DataFrame. In this case, everything works as expected.
The left index has a complete set of name. Only some of those names (but not all) match the names on the right DataFrame. In this case, we do as I suggested in (2) above, namely, use the names from the left index.
The left index is missing some (or all) names. In this case, I think we should raise an error, as it’s not clear what a good naming convention should be for missing names in a MultiIndex. If we raise an error here, we probably should raise an error in the third case above for a single index.

Looking forward to your opinion.

0reactions

Johndg1974commented, Dec 7, 2019

can someone help me with python homework?

Top Results From Across the Web

Merge, join, concatenate and compare - Pandas

The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you...

pandas merge dataframe with NaN (or "unknown") for missing ...

I want to combine the two series into a data frame by index. The indices generally match, but there could be a few...

Pandas Merge DataFrames Explained Examples

In this article, I will explain how to merge DataFrames with examples like merging by columns, merging by index, merging on multiple columns,...

Combining Data in Pandas With merge(), .join(), and concat()

pandas merge(): Combining Data on Common Columns or Indices ... unmatched columns in the other object will be filled in with NaN ,...

Pandas merge() - Merging Two DataFrame Objects

on: Column or index level names to join on. These columns must be present in both the DataFrames. If not provided, the intersection...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found