question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Specify NaN behaviour in `unique()`

See original GitHub issue

Say we have an array with multiple NaN values:

>>> x = xp.full(5, xp.nan)
>>> x
Array([nan, nan, nan, nan, nan])

Should NaNs be treated as unique to one another?

>>> xp.unique(x)
Array([nan, nan, nan, nan, nan])

Or should they be treated as the same?

>>> xp.unique(x)
Array([nan])

I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental “regression” in how np.unique() deals with NaNs.

In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.

def count_unique(array):
    n_unique = 0
    nan_index = xp.isnan(array)
    for isnan, count in zip(*xp.unique(nan_index, return_counts=True)):
        if isnan:
            n_unique += count
            break
    filtered_array = array[~nan_index]
    unique_array = xp.unique(filtered_array)
    n_unique += unique_array.size
    return n_unique

And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
kgrytecommented, Oct 21, 2021

Comparison among environments:

  • MATLAB: NaNs are distinct.

    > A = [5 5 NaN NaN];
    > C = unique(A)
    C = 1×3
    
     5   NaN   NaN
    
  • Julia: returns only a single NaN

    julia> unique([1,1,NaN,2,3,NaN,NaN,NaN])
    4-element Array{Float64,1}:
       1.0
       NaN  
       2.0
       3.0
    

    The unique implementation is based on Set:

    julia> Set([1.0,2.0,NaN,2.0,NaN,3.0,NaN])
    Set([NaN, 2.0, 3.0, 1.0])
    

    And uses isequal for determining whether values are equal to one another.

    julia> isequal([1., NaN], [1., NaN])
    true
    
    julia> [1., NaN] == [1., NaN]
    false
    

    As an aside, Julia’s unique and Set treat 0.0 and -0.0 as distinct.

    julia> unique([0.0,-0.0])
    2-element Array{Float64,1}:
      0.0
     -0.0
    
  • Python: depends on use of set.

    In [1]: set([float('nan'), float('nan')])                                                                                                     
    Out[1]: {nan, nan}
    
    In [2]: nan = float('nan');
    
    In [3]: set([nan,nan])                                                                                                                        
    Out[3]: {nan}
    
  • Torch: NaNs are distinct.

    >>> t = torch.tensor([1.0, float("nan"), 1.0, float("nan")])
    >>> torch.unique(t)
    tensor([nan, nan, 1.])
    
  • TensorFlow: NaNs are distinct.

    In [1]: tf.unique([1.0,2.0,np.nan,2.0,np.nan,3.0])
    Out[1]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)
    
    In [2]: nan = float('nan');
    
    In [3]: tf.unique([1.0,2.0,nan,2.0,nan,3.0])
    Out[3]: Unique(y=<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 1.,  2., nan, nan,  3.], dtype=float32)>, idx=<tf.Tensor: shape=(6,), dtype=int32, numpy=array([0, 1, 2, 1, 3, 4], dtype=int32)>)
    
  • NumPy previously returned unique NaNs. As of v1.21, returns only a single NaN.

Downstream Libraries

  • In Matplotlib, no accommodation appears to be made for np.unique returning multiple NaNs.
  • In SciPy, there is one documented instance of using a workaround for multiple NaNs.
  • Scikit-learn has a small wrapper around np.unique to handle multiple NaNs. This wrapper assumes sorted unique values.

Proposal

  • Unique should return multiple NaNs (in-line with previous NumPy behavior and other libraries).

  • Specify sort order for floating-point values (see gh-288)

  • To return only a single NaN, users can implement a similar workaround to sklearn:

    values  = xp.unique_values(x)
    sorted_values = xp.sort(values)
    if bool(xp.isnan(sorted_values[-1])):
        // Find index of first NaN occurrence...
        idx = ...
        sorted_values = sorted_values[:idx+1]
    

    There would be some additional work for getting the indices, inverse_indices, and counts.

    Another workaround for users is to select a sentinel value other than NaN to indicate a “missing” value for the purposes of unique. For example, if values are known to be integers on the interval [1,100], replace NaN values with 0 before calling unique.

In short, the issue of multiple NaN values seems, to me at least, to be a user concern and should thus be pushed to userland. While the utility of multiple NaN values in the output of unique is undoubtedly limited, I don’t see why the prevention of this should be the responsibility of array libraries, especially when bearing this responsibility entails a performance cost.

1reaction
asmeurercommented, Oct 11, 2021

I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like values == array([nan, nan, nan, nan, nan]); counts = array([5]). Then values and counts wouldn’t even correspond to each other any more.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python NaN's in set and uniqueness - Stack Overflow
The desired behavior of float() is to return an instance of float (class). and, you're right 'nan' is not equal to itself.
Read more >
NaN - JavaScript - MDN Web Docs
!== NaN ) are specified by IEEE 754. NaN 's behaviors include: If NaN is involved in a mathematical operation (but not bitwise...
Read more >
Working with missing data — pandas 1.5.2 documentation
Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and...
Read more >
Navigating The Hell of NaNs in Python | by Julia Di Russo
In R, null and na are two different types with different behaviours. Other than numpy and as of Python 3.5, you can also...
Read more >
NaN - Wikipedia
In computing, NaN standing for Not a Number, is a member of a numeric data type that can be interpreted as a value...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found