BUG: Union of multi index with EA types can lose EA dtype
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> from cyberpandas import IPArray
>>> import pandas as pd
>>>
>>> df1 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': ['2022-01-01', '2022-01-02'],
... 'a': [1, 2]
... })
>>> df1 = df1.set_index(['address', 'date'])
>>>
>>>
>>> df2 = pd.DataFrame({
... 'address': IPArray(['192.168.1.1', '192.168.1.10']),
... 'date': pd.to_datetime(['2022-01-01', '2022-01-02']),
... 'a': [1, 2]
... })
>>> df2 = df2.set_index(['address', 'date'])
>>>
>>> df1.index.dtypes
address ip
date object
dtype: object
>>>
>>> df2.index.dtypes
address ip
date datetime64[ns]
dtype: object
>>>
>>> df1.index.union(df2.index).dtypes
address object # <-- should be type "ip", not "object"
date datetime64[ns]
dtype: object
Issue Description
The ExtensionType can get lost when two MultiIndex objects are combined by .union()
(which becomes a problem when using df.combine_first(...)
which relies on index.union(...)
).
The problem occurs when both MIs share the same EA series, but the other series (assuming only 2-series MI) has a different type. In that case, the former EA dimension of the joined MI is losing its EA dtype.
Expected Behavior
EA type can be maintained after index.union(...)
.
Installed Versions
INSTALLED VERSIONS
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.8.13.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_AU.UTF-8 LOCALE : en_AU.UTF-8
pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 62.3.2 pip : 22.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Hi, thanks for your report. As a note: This also happens for our own extension arrays.
We did a couple of prs in that are. You can check the whatsnew for 2.0, they are documented there