question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: column labels converted to string in merge

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

class Column:
    def __init__(self, name):
        self.name = name

col = Column(name='col')
df1 = pd.DataFrame({col: [1], 'X': [2]})
df2 = pd.DataFrame({col: [1], 'Y': [3]})

merged = pd.merge(left=df1, right=df2, left_index=True, right_index=True)

assert not isinstance(merged.columns.tolist()[0], str)

Issue Description

merged dataframe columns converted to string (because the suffix was added to the equal column)

> merged.columns.tolist()
['<__main__.Column object at 0x7f41edd52d50>_x',
 'X',
 '<__main__.Column object at 0x7f41edd52d50>_y',
 'Y']

Expected Behavior

I would expect merge to keep the column of type __main__.Column and not covert it to string Regards the duplication, IMO its ok to have 2 identical columns and let the user decide how to handle it by his own

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.8.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.2.0 Version : Darwin Kernel Version 20.2.0: Wed Dec 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 1.4.2 numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 20.2.3 setuptools : 49.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rhshadrachcommented, May 24, 2022

@simonjayhawkins

users not wanting duplicate column labels are accommodated and will be able to use .set_flags(allows_duplicate_labels=False) going forward.

I do not think this is correct. Index labels and column labels are not the same. Duplicate index labels occur often in frames that I work with, yet I never allow duplicate column labels because they are almost impossible to work with. I am not able to use this setting because it forbids both duplicate index and column labels, and I don’t think my usage/experience is niche.

0reactions
simonjayhawkinscommented, May 18, 2022

needs discussion for this scenario, comment was more holistic.

going forward, it should not need to be a decision (or personal preference) on whether a method returns duplicates. duplicate column labels are a documented pandas feature https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html#duplicate-labels and therefore all methods should support them, work correctly with them and correctly propagate them.

and since pandas 1.2, https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.0.html#optionally-disallow-duplicate-labels the mechanism for disallowing duplicate column labels now means that allowing/disallowing duplicate column labels should not need to be incorporated into the api design of individual methods.

I’m -1 on having duplicate columns in the result.

sure. users not wanting duplicate column labels are accommodated and will be able to use .set_flags(allows_duplicate_labels=False) going forward.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas - Merging on string columns not working (bug?)
I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and...
Read more >
How to fix Mail Merge formatting issues in Word - Ablebits
Before starting a mail merge, perform the following steps in Microsoft Word. Go to File > Options > Advanced. Scroll down to the...
Read more >
pd.merge() doesn't merge int and str column dtypes but no ...
When merging an int dtype with a str dtype the join does not work: ... pd.merge() doesn't merge int and str column dtypes...
Read more >
Troubleshooting Data Merge Errors - CreativePro Network
The data merge field names that are already placed in the InDesign file can't be found in the database that has just been...
Read more >
Merge, join, concatenate and compare - Pandas
Strings passed as the on , left_on , and right_on parameters may refer to either column names or index level names. This enables...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found