Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Merge fails on nullable integer columns when dtypes are unequal

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, pd.NA], dtype=pd.Int64Dtype()),
        "val1": pd.Series(["a", "b", "c"]),
    }
)
non_null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, 2], dtype=np.int64),  # This fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.UInt64Dtype()),  # This also fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.Int64Dtype()),  # This works!
        "val2": pd.Series(["d", "e", "f"]),
    }
)
print(pd.merge(null_keys, non_null_keys, how="left", on="key").to_markdown())

Issue Description

If you try to merge two (or more, probably) objects, and all of the following are true

col1 is a nullable integer type (eg pd.Int64Dtype())
col1 contains nulls
col2 is an integer type (either nullable eg pd.Int64Dtype() or non-nullable eg np.int64)
col1.dtype != col2.dtype then the merge fails with:

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    pd.merge(null_keys, non_null_keys, on="key")
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 123, in merge
    return op.get_result()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 717, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 968, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 942, in _get_join_indexers
    return get_join_indexers(
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1497, in get_join_indexers
    zipped = zip(*mapped)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1494, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 2227, in _factorize_keys
    lk = ensure_int64(np.asarray(lk))
  File "pandas/_libs/algos_common_helper.pxi", line 81, in pandas._libs.algos.ensure_int64
    return arr.astype(np.int64, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'

However, if I change the np.int64 dtype to pd.Int64Dtype() as well (eg both types are equal), then the merge works fine. (Note that changing both types to non-nullable types isn’t applicable, because some of the keys are indeed null so that would be impossible). Also fails if one is Int64 and the other is UInt64. Tested on main.

The offending code is probably in pandas/core/reshape/merge.py, line 2227, in _factorize_keys, but I didn’t dig any deeper than that.

Expected Behavior

The merge should act the same as when both keys are nullable and dtypes are equal (e.g. both Int64Dtype):

	key	val1	val2
0	0	a	d
1	1	b	e
2	nan	c	nan

Installed Versions

INSTALLED VERSIONS

commit : ec7543622a415301849bfa2f64243573c9fe5bee python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.10.47-linuxkit Version : #1 SMP Sat Jul 3 21:51:47 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : None.None

pandas : 1.5.0.dev0+688.gec7543622a numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 60.9.3 Cython : 0.29.28 pytest : 7.0.1 hypothesis : 6.39.1 sphinx : 4.3.2 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.0 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.31 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None