ENH: When chaining multiple .merge() functions, only the second "suffixes" param produces results
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
df =pd.DataFrame({'year':[2005, 2006], 'cusip':['111', '222']})
df_comp = pd.DataFrame({'year':[2005, 2006], 'cusip':['111', '222'], 'test': [5, 10]})
df_comp['year'] = df_comp['year'].astype(int)
df = df.merge(df_comp, how='left', on=['cusip', 'year'], indicator=True, suffixes=[None, '_wyear'])
print(df['_merge'].value_counts())
df = df.drop(columns='_merge')
df_comp2 = df_comp.drop(columns='year').drop_duplicates()
df = df.merge(df_comp2, how='left', on=['cusip'], indicator=True, suffixes=[None, '_woyear'])
print(df['_merge'].value_counts())
df = df.drop(columns='_merge')
print(df.columns)
Index([‘year’, ‘cusip’, ‘test’, ‘test_woyear’], dtype=‘object’)
Problem description
The suffixes parameter of _merge only produces a suffix for the second merge not the first. This behavior persists if I switch the two merges.
Expected Output
Index([‘year’, ‘cusip’, ‘test_wyear’, ‘test_woyear’], dtype=‘object’)
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : de_DE.UTF-8
pandas : 1.2.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.2.4 setuptools : 41.2.0 Cython : 0.29.14 pytest : 6.2.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.7 lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : 4.8.2 bottleneck : None fsspec : 2021.04.0 fastparquet : None gcsfs : None matplotlib : 3.4.1 numexpr : None odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.5.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Will change the label, though -1 on the enhancement proposal. I’d imagine the more common use case for
suffix
is when you might have a small number of duplicate columns, but many distinct ones, so adding a suffix to those non-duplicate columns would change meaning unnecessarily. If a schema is well designed, the meaning of a column should not change when merging two data frames (https://en.wikipedia.org/wiki/Entity–relationship_model). What use case do you have where you’d like to always add suffixes?(If you just want a convenient way of adding suffixes, there’s already https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_suffix.html)
suffixes only works in case you have overlapping column names, in first merge you don’t have “test” column in df, thus there are no overlapping column names.
You can use a df.rename" if you actually want to have a different column name, after both the merges.