Right/outer merge behaviour on left column and right index is unexpected
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
big_index = [123, 124, 125, 126, 127, 128, 129, 130]
big_dat = {'year': pd.Series([2000, 2000, 2000, 2001, 2002, 2002, 2002, 2004], index=big_index),
'other': pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], index=big_index)}
big_df = pd.DataFrame(big_dat)
year_index = [2003, 2000, 2001, 2002]
year_dat = {'a': pd.Series([1, 2, 3, 4], index=year_index),
'b': pd.Series([5, 6, 7, 8], index=year_index)}
year_df = pd.DataFrame(year_dat)
merged_right = pd.merge(
big_df,
year_df,
how='right',
left_on='year',
right_index=True
)
merged_outer = pd.merge(
big_df,
year_df,
how='outer',
left_on='year',
right_index=True
)
merged_right
Out[5]:
other year a b
123 a 2000 2 6
124 b 2000 2 6
125 c 2000 2 6
126 d 2001 3 7
127 e 2002 4 8
128 f 2002 4 8
129 g 2002 4 8
130 NaN 2003 1 5
merged_outer
Out[6]:
other year a b
123 a 2000 2.0 6.0
124 b 2000 2.0 6.0
125 c 2000 2.0 6.0
126 d 2001 3.0 7.0
127 e 2002 4.0 8.0
128 f 2002 4.0 8.0
129 g 2002 4.0 8.0
130 h 2004 NaN NaN
130 NaN 2003 1.0 5.0
Problem description
merged outer and merged right both return year 2003, a 1, b 5 associated with index 130.
Expected Output
Either a NaN index entry or an error/warning. There’s no obvious reason to associate that data with 130 index
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None
pandas: 0.20.2 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:13 (11 by maintainers)
Top GitHub Comments
Ah, that’s a little better to read now! I do agree: I too am a little perplexed why there would be such a row, as I don’t know where that would show up in either such join.
Ok thanks.
I have a small follow up question about the initial example. I would expect that the index there is the index of the right
DataFrame
(year_df) because we have a right join. It seems wrong that we get the left index after a right join.