question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DOC: Breaking change for join, int to float type coercion

See original GitHub issue

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

Documentation problem

The behaviour of join appears to have changed somewhere between pandas version 1.2.4 and 1.4.2.

If if perform an inner join on a float col with a int col then in 1.2.4 the result is ant int64 column. In 1.42 it’s a float64. The below test passes in 1.2.4 but fails in 1.4.2.

I couldn’t find any documentation of this behaviour although it seems reasonable it caused me a bit of pain to get to the bottom of and is a breaking change. Is this behaviour documented anywhere and if not could it be?

import pandas as pd
from pandas.testing import assert_frame_equal


def test_join_type_coercion():
    left = [
        {"id": 1, "name": "Chase"},
        {"id": 2, "name": "Sky"},
        {"id": 3, "name": "Marhsall"},
        {"id": None, "name": "Daring Danny X"},
    ]
    right = [
        {"id": 1, "color": "Blue"},
        {"id": 2, "color": "Pink"},
        {"id": 3, "color": "Red"},
    ]

    left_df = pd.DataFrame(left)
    right_df = pd.DataFrame(right)

    joined = (
        left_df.set_index("id")
            .join(right_df.set_index("id"), how="inner")
            .reset_index()
    )
    assert_frame_equal(pd.DataFrame([
        {"id": 1, "name": "Chase", "color": "Blue"},
        {"id": 2, "name": "Sky", "color": "Pink"},
        {"id": 3, "name": "Marhsall", "color": "Red"},
    ]), joined)

Suggested fix for documentation

I’m not too sure what the pattern is here but maybe a note to describe the change in behaviour and at what version it was introduced.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
guyrtcommented, Jun 20, 2022

Sorry for misreading that. This change happened between 1.2.5 and 1.3.0 and nothing in 1.3.0 change log would have obviously caused it. So, that’s the right place to include a note on the breaking change. I’m happy to put together right message and submit PR (doing a little investigation first to understand when the change happens - e.g. I know it only occurs when you use indices in the underlying merge).

1reaction
oxlade39commented, Jun 19, 2022

The core problem is that numpy’s int datatypes do not allow NaN, but float does. The dtype is coalesced into a numpy float64. This is covered explicitly in https://pandas.pydata.org/docs/user_guide/gotchas.html#support-for-integer-na. See in particular the “NA Type promotions” section in second link.

I’ll defer to those with more experience in Pandas, docs, but this seems inappropriate to call out explicitly in the join documentation given it’s well covered elsewhere. I can see the argument for including in changelogs given it’s changed behavior. Are those ever edited post hoc?

How has this changed in pandas 1.2.4 vs 1.4.2? My point is really that the behaviour of join has changed in a breaking way. This (the change in behaviour) should surely be documented somewhere?

To hopefully add clarity. Although the None does cause the first DataFrame to coalesce the id to a float that’s unrelated to what I’m trying to demonstrate here. I could (and perhaps should) have typed out the ids as floats.

Either way, a join on a float with int used to result in an int, now it doesn’t.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When does Python perform type conversion when comparing ...
Another answer pointed to source code that shows us the approach of simply converting one type to another does not fulfill the specified ......
Read more >
Implicit conversions - cppreference.com
Implicit conversions are performed whenever an expression of some type T1 is used in context that does not accept that type, but accepts...
Read more >
Type conversion functions - IBM
Use the type conversion functions to change the type of an argument.
Read more >
How To Convert Data Types in Go - DigitalOcean
Converting Number Types. Go has several numeric types to choose from. Primarily they break out into two general types: integers and floating- ...
Read more >
Built-in Types — Python 3.11.1 documentation
Two methods support conversion to and from hexadecimal strings. Since Python's floats are stored internally as binary numbers, converting a float to or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found