Consider using IntegerArrays in to_pandas() for masked int columns
See original GitHub issueCurrently, to_pandas converts any masked columns automatically to float (even if they are integers). This is actually a pretty big problem for integer identifiers (for example, GAIA source ids). If the integer is too large, the double-precision float actually gets approximated in the last few digits, completely changing the identifier. As this can happen with something as common as GAIA data, it’s a big problem. What I assume most people would do is convert to_pandas
, and then convert the data-type back to int
, but this will now produce a different number.
More recent versions of pandas now support nullable integers (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html), using the new IntegerArrays
(with data type pd.Int64Dtype()
). With that now supported, it would make sense for the astropy.table.to_pandas to turn masked int columns into this new data type, rather than float. pandas IntegerArrays
are apparently still “experimental” but at the same time I think it’s worth doing anyway since turning integers to floats can have pretty severe side-effects in astronomical data, which can take a long time to debug.
At the very least, if the above isn’t implemented yet, I think it would make sense for astropy to produce a warning when it’s converting integers to floats in this way, especially if the integers are large enough that the floats will only approximate them.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
Okay I was looking into it a bit, it seems the issue is here:
out[name] = column.astype(float).filled(np.nan)
at:https://github.com/astropy/astropy/blob/ad4e0f271bde866a2fe2ec6507fcffacf136e0b7/astropy/table/table.py#L3286
Right now, if it detects a
MaskedColumn
, even if thecolumn.dtype.kind in ('i', 'u')
is True, it still converts to float.Would it be as simple as something like changing the above to:
out[name] = pd.Series(column, dtype='Int64')
? I tested it on my own and that seems to work, so should I do a pull/push request? Haven’t really done this before.Follow the instructions here, and if you run into any trouble I can come by and help on Wednesday. Thanks for working on the fix!
EDIT: Updated link to point to
latest
instead ofstable
.