API: distinguish NA vs NaN in floating dtypes
See original GitHub issueContext: in the original pd.NA
proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about pd.NA
vs np.nan
was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan
for float as pd.NaT
for datetime-like).
With the introduction of pd.NA
, and if we want consistent “NA behaviour” across dtypes at some point in the future, I think there are two options for float dtypes:
- Keep using
np.nan
as we do now, but change its behaviour (e.g. in comparison ops) to matchpd.NA
- Start using
pd.NA
in float dtypes
Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy’s behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan
.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as “missing” or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)
Actual discussion items: assume we are going to add floating dtypes that use pd.NA
as missing value indicator. Then the following question comes up:
If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?
So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan
as “normal”, unmasked value in the actual data, pd.NA
tracked in the mask). But we also need to decide if we want this.
This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:
[@Dr-Irv in comment] I think it is important to distinguish between NA meaning “data missing” versus NaN meaning “not a number” / “bad computational result”.
vs
[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won’t be ever used).
So I think those two describe nicely the two options we have on the question do we want both pd.NA
and np.nan
in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and “disallow” NaN (or interpret / convert any NaN on input to NA).
A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post).
That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning “missing data”. So should there be separate markers - one to mean “missing value” and the other to mean “bad computational result” (typically 0/0
) ?
A dummy example showing how both can occur:
>>> pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0 NaN
1 1.0
2 <NA>
dtype: float64
The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).
So, yes, it is possible and potentially desirable to allow both pd.NA
and np.nan
in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna
/notna
/dropna
/fillna
? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna
keyword, like sum, mean, etc)?
Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).
Some other various considerations:
-
Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
-
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
-
How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a
to_numpy(.., na_value=np.nan)
explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038
-
What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (
pdSeries(np.array([0.1, np.nan]))
) Do we convert NaNs to NA automatically by default?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:75 (71 by maintainers)
Top GitHub Comments
To get a sense of the room here, can I get a show of hands: does having np.nan/pd.NA be semantically distinct matter to you? Use reaction emojis to get a tally. Rocket emoji for “I don’t care”, Eyes emoji for “I don’t care, as long as my existing workflows aren’t affected.”
The SQL (at least PostgreSQL, SQLite) example is indeed not a very good example. For Postgres, although they support storing NaN, they don’t support the operations that can create them, and they also deviate from the standard on how to treat them (eg in ordering, see https://www.postgresql.org/docs/9.1/datatype-numeric.html). Now, there are also other (more modern?) SQL databases such as Clickhouse that seem to support NaN (at least have both, and create NaN on 0/0, but didn’t look in detail how they are handled in for example reductions).
It’s certainly true that we shouldn’t look too much at others (certainly if for almost every option you might find an example), and decide what is best for pandas. But I think it is still valuable information for deciding that.
One strong reason for me to have both NaN and NA, is a practical reason: the computational “engine” backing pandas and other dataframes will for the foreseeable future be mostly numpy and maybe increasingly arrow, I think (pandas itself, and thus also dask, is now based on numpy; fletcher is experimenting with pandas based on arrow; cudf and vaex are based on arrow, modin has both a pandas and arrow backend). And both those computational engines have NaNs and produce NaNs from computations. So (I think) it will always be the easiest to just go with whathever those computational engines return in unary/binary operations.
Note that in such a case we can still see NaN as missing when it comes to skipping (which is eg what R does) if that is preferred, which basically would delay the cost for checking NaNs to those operations (instead of already needing to do the check on binary operations like division).
I will try to make a summary of the discussion and arguments up to now.