BUG: Merge fails on nullable integer columns when dtypes are unequal
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
null_keys = pd.DataFrame(
{
"key": pd.Series([0, 1, pd.NA], dtype=pd.Int64Dtype()),
"val1": pd.Series(["a", "b", "c"]),
}
)
non_null_keys = pd.DataFrame(
{
"key": pd.Series([0, 1, 2], dtype=np.int64), # This fails!
# "key": pd.Series([0, 1, 2], dtype=pd.UInt64Dtype()), # This also fails!
# "key": pd.Series([0, 1, 2], dtype=pd.Int64Dtype()), # This works!
"val2": pd.Series(["d", "e", "f"]),
}
)
print(pd.merge(null_keys, non_null_keys, how="left", on="key").to_markdown())
Issue Description
If you try to merge two (or more, probably) objects, and all of the following are true
- col1 is a nullable integer type (eg
pd.Int64Dtype()
) - col1 contains nulls
- col2 is an integer type (either nullable eg
pd.Int64Dtype()
or non-nullable egnp.int64
) col1.dtype != col2.dtype
then the merge fails with:
Traceback (most recent call last):
File "test.py", line 20, in <module>
pd.merge(null_keys, non_null_keys, on="key")
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 123, in merge
return op.get_result()
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 717, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 968, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 942, in _get_join_indexers
return get_join_indexers(
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1497, in get_join_indexers
zipped = zip(*mapped)
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1494, in <genexpr>
_factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
File "/workspaces/pandas/pandas/core/reshape/merge.py", line 2227, in _factorize_keys
lk = ensure_int64(np.asarray(lk))
File "pandas/_libs/algos_common_helper.pxi", line 81, in pandas._libs.algos.ensure_int64
return arr.astype(np.int64, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'
However, if I change the np.int64
dtype to pd.Int64Dtype()
as well (eg both types are equal), then the merge works fine. (Note that changing both types to non-nullable types isn’t applicable, because some of the keys are indeed null so that would be impossible). Also fails if one is Int64
and the other is UInt64
. Tested on main.
The offending code is probably in pandas/core/reshape/merge.py
, line 2227, in _factorize_keys
, but I didn’t dig any deeper than that.
Expected Behavior
The merge should act the same as when both keys are nullable and dtypes are equal (e.g. both Int64Dtype
):
key | val1 | val2 | |
---|---|---|---|
0 | 0 | a | d |
1 | 1 | b | e |
2 | nan | c | nan |
Installed Versions
INSTALLED VERSIONS
commit : ec7543622a415301849bfa2f64243573c9fe5bee python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.10.47-linuxkit Version : #1 SMP Sat Jul 3 21:51:47 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : None.None
pandas : 1.5.0.dev0+688.gec7543622a numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 60.9.3 Cython : 0.29.28 pytest : 7.0.1 hypothesis : 6.39.1 sphinx : 4.3.2 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.0 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.31 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
@phofl I don’t have permissions to add the duplicate label, IDK if you want to. Thanks!
@NickCrews this looks like a duplicate of #46178?