Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Specify NaN behaviour in `unique()`

See original GitHub issue

Say we have an array with multiple NaN values:

>>> x = xp.full(5, xp.nan)
>>> x
Array([nan, nan, nan, nan, nan])

Should NaNs be treated as unique to one another?

>>> xp.unique(x)
Array([nan, nan, nan, nan, nan])

Or should they be treated as the same?

>>> xp.unique(x)
Array([nan])

I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental “regression” in how np.unique() deals with NaNs.

In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.

def count_unique(array):
    n_unique = 0
    nan_index = xp.isnan(array)
    for isnan, count in zip(*xp.unique(nan_index, return_counts=True)):
        if isnan:
            n_unique += count
            break
    filtered_array = array[~nan_index]
    unique_array = xp.unique(filtered_array)
    n_unique += unique_array.size
    return n_unique

And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.

Issue Analytics

State:
Created 2 years ago
Comments:18 (13 by maintainers)

Top GitHub Comments

2reactions

kgrytecommented, Oct 21, 2021

Comparison among environments:

MATLAB: NaNs are distinct.

> A = [5 5 NaN NaN];
> C = unique(A)
C = 1×3

 5   NaN   NaN

Julia: returns only a single NaN

julia> unique([1,1,NaN,2,3,NaN,NaN,NaN])
4-element Array{Float64,1}:
   1.0
   NaN  
   2.0
   3.0

The unique implementation is based on Set:

julia> Set([1.0,2.0,NaN,2.0,NaN,3.0,NaN])
Set([NaN, 2.0, 3.0, 1.0])

And uses isequal for determining whether values are equal to one another.

julia> isequal([1., NaN], [1., NaN])
true

julia> [1., NaN] == [1., NaN]
false

As an aside, Julia’s unique and Set treat 0.0 and -0.0 as distinct.

julia> unique([0.0,-0.0])
2-element Array{Float64,1}:
  0.0
 -0.0

Python: depends on use of set.

In [1]: set([float('nan'), float('nan')])                                                                                                     
Out[1]: {nan, nan}

In [2]: nan = float('nan');

In [3]: set([nan,nan])                                                                                                                        
Out[3]: {nan}

Torch: NaNs are distinct.

>>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
>>> torch.unique(t)
tensor([nan, nan, 1.])

TensorFlow: NaNs are distinct.

In [1]: tf.unique([1.0,2.0,np.nan,2.0,np.nan,3.0])
Out[1]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)

In [2]: nan = float('nan');

In [3]: tf.unique([1.0,2.0,nan,2.0,nan,3.0])
Out[3]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)

NumPy previously returned unique NaNs. As of v1.21, returns only a single NaN.

Downstream Libraries

In Matplotlib, no accommodation appears to be made for np.unique returning multiple NaNs.
In SciPy, there is one documented instance of using a workaround for multiple NaNs.
Scikit-learn has a small wrapper around np.unique to handle multiple NaNs. This wrapper assumes sorted unique values.

Proposal

Unique should return multiple NaNs (in-line with previous NumPy behavior and other libraries).
Specify sort order for floating-point values (see gh-288)
To return only a single NaN, users can implement a similar workaround to sklearn:
```
values  = xp.unique_values(x)
sorted_values = xp.sort(values)
if bool(xp.isnan(sorted_values[-1])):
    // Find index of first NaN occurrence...
    idx = ...
    sorted_values = sorted_values[:idx+1]
```
There would be some additional work for getting the indices, inverse_indices, and counts.

Another workaround for users is to select a sentinel value other than NaN to indicate a “missing” value for the purposes of unique. For example, if values are known to be integers on the interval [1,100], replace NaN values with 0 before calling unique.

In short, the issue of multiple NaN values seems, to me at least, to be a user concern and should thus be pushed to userland. While the utility of multiple NaN values in the output of unique is undoubtedly limited, I don’t see why the prevention of this should be the responsibility of array libraries, especially when bearing this responsibility entails a performance cost.

1reaction

asmeurercommented, Oct 11, 2021

I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like values == array([nan, nan, nan, nan, nan]); counts = array([5]). Then values and counts wouldn’t even correspond to each other any more.

Top Results From Across the Web

Python NaN's in set and uniqueness - Stack Overflow

The desired behavior of float() is to return an instance of float (class). and, you're right 'nan' is not equal to itself.

NaN - JavaScript - MDN Web Docs

!== NaN ) are specified by IEEE 754. NaN 's behaviors include: If NaN is involved in a mathematical operation (but not bitwise...

Working with missing data — pandas 1.5.2 documentation

Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and...

Navigating The Hell of NaNs in Python | by Julia Di Russo

In R, null and na are two different types with different behaviours. Other than numpy and as of Python 3.5, you can also...

NaN - Wikipedia

In computing, NaN standing for Not a Number, is a member of a numeric data type that can be interpreted as a value...