BUG: merge left and merge inner produce different index-order
See original GitHub issueAccording to the df.merge
docstring and documentation, concerning the how
parameter:
left: use only keys from left frame, similar to a SQL left outer join; preserve key order. inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left'
and how='inner'
(or omitted). Of course, some keys might not be present in the second case, but that’s beside the point here.
If my understanding is incorrect, my apologies and feel free to discard this report.
If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.
Code Sample
import pandas as pd
import numpy as np
data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M'))
data['q'] = data.index.map(lambda ts: ts.quarter)
data['even'] = data.index.map(lambda ts: ts.month % 2 ==0)
cols = ['even', 'q']
av = data.groupby(cols).apply(lambda gr: gr[['value']].mean())
df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True)
df2 = data.merge(av, how='left', left_on=cols, suffixes=['', '_av'], right_index=True)
The dataframes:
data:
value q even
2020-01-31 74 1 False
2020-02-29 87 1 True
2020-03-31 79 1 False
2020-04-30 74 2 True
2020-05-31 71 2 False
2020-06-30 80 2 True
2020-07-31 94 3 False
2020-08-31 19 3 True
2020-09-30 58 3 False
2020-10-31 97 4 True
2020-11-30 5 4 False
2020-12-31 16 4 True
av:
value
even q
False 1 76.5
2 71.0
3 76.0
4 5.0
True 1 87.0
2 77.0
3 19.0
4 56.5
Output
df2: #as expected and wanted
value q even value_av
2020-01-31 74 1 False 76.5
2020-02-29 87 1 True 87.0
2020-03-31 79 1 False 76.5
2020-04-30 74 2 True 77.0
2020-05-31 71 2 False 71.0
2020-06-30 80 2 True 77.0
2020-07-31 94 3 False 76.0
2020-08-31 19 3 True 19.0
2020-09-30 58 3 False 76.0
2020-10-31 97 4 True 56.5
2020-11-30 5 4 False 5.0
2020-12-31 16 4 True 56.5
df1: #not as expected
value q even value_av
2020-01-31 74 1 False 76.5
2020-03-31 79 1 False 76.5
2020-02-29 87 1 True 87.0
2020-04-30 74 2 True 77.0
2020-06-30 80 2 True 77.0
2020-05-31 71 2 False 71.0
2020-07-31 94 3 False 76.0
2020-09-30 58 3 False 76.0
2020-08-31 19 3 True 19.0
2020-10-31 97 4 True 56.5
2020-12-31 16 4 True 56.5
2020-11-30 5 4 False 5.0
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : en LOCALE : None.None
pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 42.0.2.post20191203 Cython : 0.29.15 pytest : 5.4.1 hypothesis : 5.8.3 sphinx : 2.4.4 blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.8.2 bottleneck : 1.3.2 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : None numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.15 tables : 3.6.1 tabulate : None xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.8 numba : 0.48.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Here’s a smaller example:
Does look like the documentation isn’t entirely accurate here.
@dsaxton I think you could make a case for
[1, 1, 2, 1, 1]
in your example above. Taking every left key in the original order and matching it with values from the other df. Expescially since left produces this result (right too, should my pr get pulled). I would not expect[1, 1, 1, 2, 1]