Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas casting int64 to float64, misrepresenting value

See original GitHub issue

I have the following data being returned by Presto (single column, 6 rows):

[(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)

Due to the missing data (None), Pandas infers the type as float64, converting the value to a wrong id:

>>> column_names = ['organization_lyft_id']
>>> data = [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0                   NaN
1          1.239162e+18
2                   NaN
3                   NaN
4                   NaN
5                   NaN
>>> print(df.dtypes)
organization_lyft_id    float64
dtype: object

The number then shows up as 1239162456494753800 in SQL Lab.

Here’s the Pandas documentation on this:

… pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers. (emphasis mine)

Note that if the missing data is filtered the value is inferred as an int64, and it shows up correctly in SQL Lab:

>>> column_names = ['organization_lyft_id']
>>> data = [(1239162456494753670,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects()  # SupersetDataFrame
>>> print(df)
   organization_lyft_id
0   1239162456494753670
>>> print(df.dtypes)
organization_lyft_id    int64
dtype: object

The solution is to pass a dtype argument when creating the Pandas data frame, built from the cursor description. I’m working on a fix for this.

Issue Analytics

State:
Created 4 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

issue-label-bot[bot]commented, Sep 14, 2019

Issue-Label Bot is automatically applying the label #bug to this issue, with a confidence of 0.57. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

0reactions

john-bodleycommented, Dec 3, 2019

We noticed an issue with the the Numpy reshaping logic in SQL Lab. Here labels is an ARRAY<STRING> and renders correctly if multiple columns are selected but it incorrectly reshaped if it’s the only column.

Screen Shot 2019-12-03 at 9 46 23 AM

Screen Shot 2019-12-03 at 9 45 49 AM

Top Results From Across the Web

Pandas casting int64 to float64, misrepresenting value #8225

The solution is to pass a dtype argument when creating the Pandas data frame, built from the cursor description. I'm working on a...

convert pandas dataframe datatypes from float64 into int64

I am trying to read CSV file by using python pandas, in the resultant dataframe one column is returned as float64 datatype instead...

How to Convert Integers to Floats in Pandas DataFrame?

In the above example, we change the data type of columns 'Age' and 'Strike_rate' from 'int64' to 'float64'. Method 2: Using pandas.to_numeric() ...

Pandas Convert Column to Int in DataFrame

Use pandas DataFrame.astype(int) and DataFrame.apply() methods to convert a column ... is string/object hilding integer value and Discount is float64 type.

10 tricks for converting Data to a Numeric Type in Pandas

In Pandas, missing values are given the value NaN , short for “Not a Number”. For technical reasons, these NaN values are always...