question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: merge left and merge inner produce different index-order

See original GitHub issue

According to the df.merge docstring and documentation, concerning the how parameter:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order. inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left' and how='inner' (or omitted). Of course, some keys might not be present in the second case, but that’s beside the point here.

If my understanding is incorrect, my apologies and feel free to discard this report.

If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M'))
data['q'] = data.index.map(lambda ts: ts.quarter)
data['even'] = data.index.map(lambda ts: ts.month % 2 ==0)
cols = ['even', 'q']
av = data.groupby(cols).apply(lambda gr: gr[['value']].mean())
df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True)
df2 = data.merge(av, how='left',  left_on=cols, suffixes=['', '_av'], right_index=True)

The dataframes:

data:
            value  q   even
2020-01-31     74  1  False
2020-02-29     87  1   True
2020-03-31     79  1  False
2020-04-30     74  2   True
2020-05-31     71  2  False
2020-06-30     80  2   True
2020-07-31     94  3  False
2020-08-31     19  3   True
2020-09-30     58  3  False
2020-10-31     97  4   True
2020-11-30      5  4  False
2020-12-31     16  4   True

av:
         value
even  q       
False 1   76.5
      2   71.0
      3   76.0
      4    5.0
True  1   87.0
      2   77.0
      3   19.0
      4   56.5

Output

df2: #as expected and wanted
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-02-29     87  1   True      87.0
2020-03-31     79  1  False      76.5
2020-04-30     74  2   True      77.0
2020-05-31     71  2  False      71.0
2020-06-30     80  2   True      77.0
2020-07-31     94  3  False      76.0
2020-08-31     19  3   True      19.0
2020-09-30     58  3  False      76.0
2020-10-31     97  4   True      56.5
2020-11-30      5  4  False       5.0
2020-12-31     16  4   True      56.5

df1: #not as expected
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-03-31     79  1  False      76.5
2020-02-29     87  1   True      87.0
2020-04-30     74  2   True      77.0
2020-06-30     80  2   True      77.0
2020-05-31     71  2  False      71.0
2020-07-31     94  3  False      76.0
2020-09-30     58  3  False      76.0
2020-08-31     19  3   True      19.0
2020-10-31     97  4   True      56.5
2020-12-31     16  4   True      56.5
2020-11-30      5  4  False       5.0
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : None.None

pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 42.0.2.post20191203 Cython : 0.29.15 pytest : 5.4.1 hypothesis : 5.8.3 sphinx : 2.4.4 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : None numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.15 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.8 numba : 0.48.0

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
dsaxtoncommented, Apr 15, 2020

Here’s a smaller example:

import pandas as pd

left = pd.DataFrame({"a": [1, 2, 1]})
right = pd.DataFrame({"a": [1, 2]})
print(pd.merge(left, right, how="inner", on="a"))
#    a
# 0  1
# 1  1
# 2  2

Does look like the documentation isn’t entirely accurate here.

0reactions
phoflcommented, Oct 25, 2020

@dsaxton I think you could make a case for [1, 1, 2, 1, 1] in your example above. Taking every left key in the original order and matching it with values from the other df. Expescially since left produces this result (right too, should my pr get pulled). I would not expect [1, 1, 1, 2, 1]

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bug in pandas.DataFrame.merge? - python - Stack Overflow
Documentation says left_on: label or list, or array-like . I couldn't find a definition of array-like in Pandas, but in this case it...
Read more >
pandas.merge — pandas 1.5.2 documentation
inner : use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys. cross:...
Read more >
Manage data with Transact-SQL | Microsoft Press Store
Determine proper usage of INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and CROSS JOIN. Construct multiple JOIN operators using AND and OR. Determine ...
Read more >
Google Visualization API Reference | Charts
This table is sorted by the key columns, from left to right. When joinMethod is 'inner', all key cells should be populated. For...
Read more >
Merge problems - incorrect expand
Hi all, I have the following problem. I have 2 tables, which have the following approximate format: Table 1 (selection of columns) Email...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found