question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider using IntegerArrays in to_pandas() for masked int columns

See original GitHub issue

Currently, to_pandas converts any masked columns automatically to float (even if they are integers). This is actually a pretty big problem for integer identifiers (for example, GAIA source ids). If the integer is too large, the double-precision float actually gets approximated in the last few digits, completely changing the identifier. As this can happen with something as common as GAIA data, it’s a big problem. What I assume most people would do is convert to_pandas, and then convert the data-type back to int, but this will now produce a different number.

More recent versions of pandas now support nullable integers (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html), using the new IntegerArrays (with data type pd.Int64Dtype()). With that now supported, it would make sense for the astropy.table.to_pandas to turn masked int columns into this new data type, rather than float. pandas IntegerArrays are apparently still “experimental” but at the same time I think it’s worth doing anyway since turning integers to floats can have pretty severe side-effects in astronomical data, which can take a long time to debug.

At the very least, if the above isn’t implemented yet, I think it would make sense for astropy to produce a warning when it’s converting integers to floats in this way, especially if the integers are large enough that the floats will only approximate them.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
mar-sescommented, Nov 4, 2019

Okay I was looking into it a bit, it seems the issue is here: out[name] = column.astype(float).filled(np.nan) at:

https://github.com/astropy/astropy/blob/ad4e0f271bde866a2fe2ec6507fcffacf136e0b7/astropy/table/table.py#L3286

Right now, if it detects a MaskedColumn, even if the column.dtype.kind in ('i', 'u') is True, it still converts to float.

Would it be as simple as something like changing the above to: out[name] = pd.Series(column, dtype='Int64')? I tested it on my own and that seems to work, so should I do a pull/push request? Haven’t really done this before.

1reaction
bmorris3commented, Nov 4, 2019

Follow the instructions here, and if you run into any trouble I can come by and help on Wednesday. Thanks for working on the fix!

EDIT: Updated link to point to latest instead of stable.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keeping array type as integer while having a NaN value ...
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked ...
Read more >
pandas.arrays.IntegerArray — pandas 1.5.2 documentation
We represent an IntegerArray with 2 numpy arrays: data: contains a numpy integer array of the appropriate dtype. mask: a boolean array holding...
Read more >
Working with missing data — pandas 1.5.2 documentation
Integer dtypes and missing data#. Because NaN is a float, a column of integers with even one missing values is cast to floating-point...
Read more >
Indexing and selecting data — pandas 1.5.2 documentation
Different choices for indexing# ·.loc is primarily label based, but may also be used with a boolean array. . ·.iloc is primarily integer...
Read more >
Indexing and Selecting Data — pandas 0.15.2 documentation
This use is not an integer position along the index); A list or array of ... If a column is not contained in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found