Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: merge left and merge inner produce different index-order

See original GitHub issue

According to the df.merge docstring and documentation, concerning the how parameter:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order. inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left' and how='inner' (or omitted). Of course, some keys might not be present in the second case, but that’s beside the point here.

If my understanding is incorrect, my apologies and feel free to discard this report.

If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M'))
data['q'] = data.index.map(lambda ts: ts.quarter)
data['even'] = data.index.map(lambda ts: ts.month % 2 ==0)
cols = ['even', 'q']
av = data.groupby(cols).apply(lambda gr: gr[['value']].mean())
df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True)
df2 = data.merge(av, how='left',  left_on=cols, suffixes=['', '_av'], right_index=True)

The dataframes:

data:
            value  q   even
2020-01-31     74  1  False
2020-02-29     87  1   True
2020-03-31     79  1  False
2020-04-30     74  2   True
2020-05-31     71  2  False
2020-06-30     80  2   True
2020-07-31     94  3  False
2020-08-31     19  3   True
2020-09-30     58  3  False
2020-10-31     97  4   True
2020-11-30      5  4  False
2020-12-31     16  4   True

av:
         value
even  q       
False 1   76.5
      2   71.0
      3   76.0
      4    5.0
True  1   87.0
      2   77.0
      3   19.0
      4   56.5

Output

df2: #as expected and wanted
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-02-29     87  1   True      87.0
2020-03-31     79  1  False      76.5
2020-04-30     74  2   True      77.0
2020-05-31     71  2  False      71.0
2020-06-30     80  2   True      77.0
2020-07-31     94  3  False      76.0
2020-08-31     19  3   True      19.0
2020-09-30     58  3  False      76.0
2020-10-31     97  4   True      56.5
2020-11-30      5  4  False       5.0
2020-12-31     16  4   True      56.5

df1: #not as expected
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-03-31     79  1  False      76.5
2020-02-29     87  1   True      87.0
2020-04-30     74  2   True      77.0
2020-06-30     80  2   True      77.0
2020-05-31     71  2  False      71.0
2020-07-31     94  3  False      76.0
2020-09-30     58  3  False      76.0
2020-08-31     19  3   True      19.0
2020-10-31     97  4   True      56.5
2020-12-31     16  4   True      56.5
2020-11-30      5  4  False       5.0

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : None.None

pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 42.0.2.post20191203 Cython : 0.29.15 pytest : 5.4.1 hypothesis : 5.8.3 sphinx : 2.4.4 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : None numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.15 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.8 numba : 0.48.0

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

dsaxtoncommented, Apr 15, 2020

Here’s a smaller example:

import pandas as pd

left = pd.DataFrame({"a": [1, 2, 1]})
right = pd.DataFrame({"a": [1, 2]})
print(pd.merge(left, right, how="inner", on="a"))
#    a
# 0  1
# 1  1
# 2  2

Does look like the documentation isn’t entirely accurate here.

0reactions

phoflcommented, Oct 25, 2020

@dsaxton I think you could make a case for [1, 1, 2, 1, 1] in your example above. Taking every left key in the original order and matching it with values from the other df. Expescially since left produces this result (right too, should my pr get pulled). I would not expect [1, 1, 1, 2, 1]