Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API: distinguish NA vs NaN in floating dtypes

See original GitHub issue

Context: in the original pd.NA proposal (https://github.com/pandas-dev/pandas/issues/28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent “NA behaviour” across dtypes at some point in the future, I think there are two options for float dtypes:

Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy’s behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as “missing” or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)

Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as “normal”, unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in https://github.com/pandas-dev/pandas/issues/28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning “data missing” versus NaN meaning “not a number” / “bad computational result”.

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won’t be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and “disallow” NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post). That reasoning was given by @Dr-Irv in https://github.com/pandas-dev/pandas/issues/28095#issuecomment-538786581: there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning “missing data”. So should there be separate markers - one to mean “missing value” and the other to mean “bad computational result” (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).

Some other various considerations:

Having both pd.NA and NaN (np.nan) might actually be more confusing for users.
If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)
How do we handle compatibility with numpy? The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion. But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

For conversion to numpy, see also some relevant discussion in https://github.com/pandas-dev/pandas/issues/30038
What with conversion / inference on input? Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:75 (71 by maintainers)

Top GitHub Comments

13reactions

jbrockmendelcommented, Jan 2, 2022

I’m probably the one who has argued for differentiating between np.nan and pd.NA as “error in computation” versus “missing data”. I’ve found this to be important in applications, where this differentiation helped find issues with underlying data or in a computation.

To get a sense of the room here, can I get a show of hands: does having np.nan/pd.NA be semantically distinct matter to you? Use reaction emojis to get a tally. Rocket emoji for “I don’t care”, Eyes emoji for “I don’t care, as long as my existing workflows aren’t affected.”

3reactions

jorisvandenbosschecommented, Mar 11, 2020

I’m not sure if consistency with other tools is much of concern when many of these don’t seem consistent with each other;

The SQL (at least PostgreSQL, SQLite) example is indeed not a very good example. For Postgres, although they support storing NaN, they don’t support the operations that can create them, and they also deviate from the standard on how to treat them (eg in ordering, see https://www.postgresql.org/docs/9.1/datatype-numeric.html). Now, there are also other (more modern?) SQL databases such as Clickhouse that seem to support NaN (at least have both, and create NaN on 0/0, but didn’t look in detail how they are handled in for example reductions).

It’s certainly true that we shouldn’t look too much at others (certainly if for almost every option you might find an example), and decide what is best for pandas. But I think it is still valuable information for deciding that.

One strong reason for me to have both NaN and NA, is a practical reason: the computational “engine” backing pandas and other dataframes will for the foreseeable future be mostly numpy and maybe increasingly arrow, I think (pandas itself, and thus also dask, is now based on numpy; fletcher is experimenting with pandas based on arrow; cudf and vaex are based on arrow, modin has both a pandas and arrow backend). And both those computational engines have NaNs and produce NaNs from computations. So (I think) it will always be the easiest to just go with whathever those computational engines return in unary/binary operations.

Note that in such a case we can still see NaN as missing when it comes to skipping (which is eg what R does) if that is preferred, which basically would delay the cost for checking NaNs to those operations (instead of already needing to do the check on binary operations like division).

I will try to make a summary of the discussion and arguments up to now.

Top Results From Across the Web

What is the difference between NaN ,None, pd.nan and np.nan?

np.nan allows for vectorized operations ; its a float value, while None , by definition, forces object type, which basically disables all efficiency...

What is the difference between NaN and None? - Stack Overflow

NaN is a numeric value, as defined in IEEE 754 floating-point standard. None is an internal Python type ( NoneType ) and would...

NaN, None and Experimental NA - Towards Data Science

Both the nullable-float dtypes can hold the experimental pandas.NA value. Although the NumPy's float uses NaN value to represent a missing ...

Difference between None and NaN in Pandas - SkyTowner

NaN , which stands for not-a-number, is a numeric type. This means that NaN can appear in columns of type int and float...

What's new in 1.4.0 (January 22, 2022) - Pandas

Notable bug fixes# · Inconsistent date string parsing# · Ignoring dtypes in concat with empty or all-NA columns# · Null-values are no longer...