question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

merge on index vs merge on column - different NaN handling

See original GitHub issue

In the following examples there are three different merge behaviours when it comes to handling NaNs. They are all based on pd.merge(…, how=“left”, …)

The difference depends on:

1) Whether we are merging using an index or a column.
2) Whether the column keys we are merging on are the same value or not (i.e. if left_on = right_on).

Arguably, if we specify “left” as the merging criterion, the desired behaviour is to have NaNs in the columns coming from the right dataframe where there is no match between the left and right dataframes’ key columns (see first merge in example below, ‘d’ and ‘e’ columns). The problem is, if we are merging on left’s index, the NaNs get filled with the index values from the left dataframe even if the names of the two columns don’t match (‘c’ and ‘d’ in the example). We are thus led to believe there was a perfect match between the index of the left dataframe and the “key” column of the right dataframe (‘d’ here).

Gotchas:

-There is something puzzling going on with the new indices of the resulting dataframe (when merging on index).
-Type casting occurs when merging on index, perhaps suggesting NaNs are explicitly filled in a second step.

Proposed behaviour:

Maybe it is simply a matter of removing this NaN filling step.

Better yet, the “key” column in the merged dataframe should perhaps bear the name of left’s index not of the “right_on” key (provided we used left’s index to merge). I.e. in the second merge of the example, the ‘d’ column should be called ‘c’.

This is really the source of the confusion when the two names are different. When they are the same the “no NaN” behaviour is arguably legitimate.

Also it might be worthwhile to cast the final column back to the original dtype if there are no NaNs.

Maybe this is not really an issue though, more something to be aware of. I would be interested in hearing any motivation behind this behaviour.

Looking forward to reading your thoughts!

Code Sample:

df1 = pd.DataFrame(columns=['a','b','c'], data=[[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
df1_c_index = df1.set_index('c')
df2 = pd.DataFrame(columns=['d','e'], data=[[3,14],[6,15],[13,16]])

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='d')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='d'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='d'), '\n'

df2.rename(columns={'d':'c'}, inplace=True)

print 'df1', '\n', df1, '\n'
print 'df1_c_index', '\n', df1_c_index, '\n'
print 'df2', '\n', df2, '\n'

print "pd.merge(df1, df2, how='left', left_on='c', right_on='c')", '\n'
print pd.merge(df1, df2, how='left', left_on='c', right_on='c'), '\n'

print "pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c')", '\n'
print pd.merge(df1_c_index, df2, how='left', left_index=True, right_on='c'), '\n'

Output:

screen shot 2016-06-05 at 21 06 48

screen shot 2016-06-05 at 21 07 05

output of pd.show_versions():

INSTALLED VERSIONS


commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 15.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8

pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 21.0.0 Cython: None numpy: 1.11.0 scipy: 0.17.1 statsmodels: None xarray: 0.7.2 IPython: 4.2.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: 1.0.0 tables: None numexpr: 2.6.0 matplotlib: 1.5.1 openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: 2.39.0 pandas_datareader: None

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:1
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Dr-Irvcommented, Nov 30, 2016

I started looking at #14076 to see if I can do something intelligent, and then found the discussion here. @jreback has a comment from Jun 6 that is confusing to me. Let’s assume you are doing a left-join, left_index=True and right_on='some_column_name'. There are 3 cases to consider:

  1. The left index has the same name 'some_column_name' as the column being merged on.
  2. The left index has a different name from 'some_column_name'.
  3. The left index has no name at all.

In the first case, this is like the 4th merge in the original example, and I think the behavior is correct.

In the second case, this is like the 2nd merge above, but I think the name of the resulting column should come from the left DataFrame index name, because the left’s DataFrame index is what is preserved as a column in the result. In the 2nd merge example above, it comes from the right, which is confusing.

In the third case, since there is no name for the left index, we could raise an error, or maybe the resulting name of the column should be called “left_index” (or something like that), to make it clear that it was that particular index that is used to create the values for the column.

When the merges involve a MultiIndex, then I think these are the cases:

  1. The left index has the same names as the set of names from the right DataFrame. In this case, everything works as expected.

  2. The left index has a complete set of name. Only some of those names (but not all) match the names on the right DataFrame. In this case, we do as I suggested in (2) above, namely, use the names from the left index.

  3. The left index is missing some (or all) names. In this case, I think we should raise an error, as it’s not clear what a good naming convention should be for missing names in a MultiIndex. If we raise an error here, we probably should raise an error in the third case above for a single index.

Looking forward to your opinion.

0reactions
Johndg1974commented, Dec 7, 2019

can someone help me with python homework?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Merge, join, concatenate and compare - Pandas
The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you...
Read more >
pandas merge dataframe with NaN (or "unknown") for missing ...
I want to combine the two series into a data frame by index. The indices generally match, but there could be a few...
Read more >
Pandas Merge DataFrames Explained Examples
In this article, I will explain how to merge DataFrames with examples like merging by columns, merging by index, merging on multiple columns,...
Read more >
Combining Data in Pandas With merge(), .join(), and concat()
pandas merge(): Combining Data on Common Columns or Indices ... unmatched columns in the other object will be filled in with NaN ,...
Read more >
Pandas merge() - Merging Two DataFrame Objects
on: Column or index level names to join on. These columns must be present in both the DataFrames. If not provided, the intersection...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found