question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Merge fails on nullable integer columns when dtypes are unequal

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, pd.NA], dtype=pd.Int64Dtype()),
        "val1": pd.Series(["a", "b", "c"]),
    }
)
non_null_keys = pd.DataFrame(
    {
        "key": pd.Series([0, 1, 2], dtype=np.int64),  # This fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.UInt64Dtype()),  # This also fails!
        # "key": pd.Series([0, 1, 2], dtype=pd.Int64Dtype()),  # This works!
        "val2": pd.Series(["d", "e", "f"]),
    }
)
print(pd.merge(null_keys, non_null_keys, how="left", on="key").to_markdown())

Issue Description

If you try to merge two (or more, probably) objects, and all of the following are true

  • col1 is a nullable integer type (eg pd.Int64Dtype())
  • col1 contains nulls
  • col2 is an integer type (either nullable eg pd.Int64Dtype() or non-nullable eg np.int64)
  • col1.dtype != col2.dtype then the merge fails with:
Traceback (most recent call last):
  File "test.py", line 20, in <module>
    pd.merge(null_keys, non_null_keys, on="key")
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 123, in merge
    return op.get_result()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 717, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 968, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 942, in _get_join_indexers
    return get_join_indexers(
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1497, in get_join_indexers
    zipped = zip(*mapped)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 1494, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
  File "/workspaces/pandas/pandas/core/reshape/merge.py", line 2227, in _factorize_keys
    lk = ensure_int64(np.asarray(lk))
  File "pandas/_libs/algos_common_helper.pxi", line 81, in pandas._libs.algos.ensure_int64
    return arr.astype(np.int64, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'

However, if I change the np.int64 dtype to pd.Int64Dtype() as well (eg both types are equal), then the merge works fine. (Note that changing both types to non-nullable types isn’t applicable, because some of the keys are indeed null so that would be impossible). Also fails if one is Int64 and the other is UInt64. Tested on main.

The offending code is probably in pandas/core/reshape/merge.py, line 2227, in _factorize_keys, but I didn’t dig any deeper than that.

Expected Behavior

The merge should act the same as when both keys are nullable and dtypes are equal (e.g. both Int64Dtype):

key val1 val2
0 0 a d
1 1 b e
2 nan c nan

Installed Versions

INSTALLED VERSIONS

commit : ec7543622a415301849bfa2f64243573c9fe5bee python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.10.47-linuxkit Version : #1 SMP Sat Jul 3 21:51:47 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : None.None

pandas : 1.5.0.dev0+688.gec7543622a numpy : 1.22.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 60.9.3 Cython : 0.29.28 pytest : 7.0.1 hypothesis : 6.39.1 sphinx : 4.3.2 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.0 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.31 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
NickCrewscommented, Apr 22, 2022

@phofl I don’t have permissions to add the duplicate label, IDK if you want to. Thanks!

1reaction
phoflcommented, Apr 22, 2022

@NickCrews this looks like a duplicate of #46178?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python pandas column dtype=object causing merge to fail with
I tested it and its working fine, the problem is not with the datatype · I think the problem is with my column...
Read more >
What's new in 1.4.0 (January 22, 2022) - Pandas
Notable bug fixes# ; Ignoring dtypes in concat with empty or all-NA columns# · In [18]: df1 ; Null-values are no longer coerced...
Read more >
How to Fix: You are trying to merge on object and int64 columns
This error occurs when you attempt to merge two pandas DataFrames but the column you're merging on is an object in one DataFrame...
Read more >
Data types | BigQuery - Google Cloud
NULL values can be specified as the first or last values for a column ... Google Standard SQL for BigQuery raises an error...
Read more >
Pandas Usage — datacompy 0.8.3 documentation
If you already have 'DATACOMPY_NULL' as a value in your join columns, the merge step will fail with a ValueError . You can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found