Pandas casting int64 to float64, misrepresenting value
See original GitHub issueI have the following data being returned by Presto (single column, 6 rows):
[(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)
Due to the missing data (None
), Pandas infers the type as float64
, converting the value to a wrong id:
>>> column_names = ['organization_lyft_id']
>>> data = [(None,), (1239162456494753670,), (None,), (None,), (None,), (None,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects() # SupersetDataFrame
>>> print(df)
organization_lyft_id
0 NaN
1 1.239162e+18
2 NaN
3 NaN
4 NaN
5 NaN
>>> print(df.dtypes)
organization_lyft_id float64
dtype: object
The number then shows up as 1239162456494753800
in SQL Lab.
Here’s the Pandas documentation on this:
… pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers. (emphasis mine)
Note that if the missing data is filtered the value is inferred as an int64, and it shows up correctly in SQL Lab:
>>> column_names = ['organization_lyft_id']
>>> data = [(1239162456494753670,)]
>>> df = pd.DataFrame(list(data), columns=column_names).infer_objects() # SupersetDataFrame
>>> print(df)
organization_lyft_id
0 1239162456494753670
>>> print(df.dtypes)
organization_lyft_id int64
dtype: object
The solution is to pass a dtype
argument when creating the Pandas data frame, built from the cursor description. I’m working on a fix for this.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
Issue-Label Bot is automatically applying the label
#bug
to this issue, with a confidence of 0.57. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
We noticed an issue with the the Numpy reshaping logic in SQL Lab. Here
labels
is anARRAY<STRING>
and renders correctly if multiple columns are selected but it incorrectly reshaped if it’s the only column.