Specify NaN behaviour in `unique()`
See original GitHub issueSay we have an array with multiple NaN values:
>>> x = xp.full(5, xp.nan)
>>> x
Array([nan, nan, nan, nan, nan])
Should NaNs be treated as unique to one another?
>>> xp.unique(x)
Array([nan, nan, nan, nan, nan])
Or should they be treated as the same?
>>> xp.unique(x)
Array([nan])
I have no dog in the race but you might want to refer to some discussion bottom of numpy/numpy#19301 and in the NumPy mailing list that relates to a recent accidental “regression” in how np.unique()
deals with NaNs.
In either case, just specificity would prevent creating wrapper methods to standardise this behaviour. For example I created this (admittedly scrappy) utility method for HypothesisWorks/hypothesis#3065 to work for both scenarios, which @asmeurer noted was not ideal and probably a failure of the Array API spec.
def count_unique(array):
n_unique = 0
nan_index = xp.isnan(array)
for isnan, count in zip(*xp.unique(nan_index, return_counts=True)):
if isnan:
n_unique += count
break
filtered_array = array[~nan_index]
unique_array = xp.unique(filtered_array)
n_unique += unique_array.size
return n_unique
And if deterministic NaN behaviour cannot be part of the API, a note should be added to say this behaviour is out of scope.
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (13 by maintainers)
Top GitHub Comments
Comparison among environments:
MATLAB:
NaN
s are distinct.Julia: returns only a single
NaN
The
unique
implementation is based onSet
:And uses
isequal
for determining whether values are equal to one another.As an aside, Julia’s
unique
andSet
treat0.0
and-0.0
as distinct.Python: depends on use of
set
.Torch:
NaN
s are distinct.TensorFlow:
NaN
s are distinct.NumPy previously returned unique
NaN
s. As ofv1.21
, returns only a singleNaN
.Downstream Libraries
np.unique
returning multiple NaNs.np.unique
to handle multiple NaNs. This wrapper assumes sorted unique values.Proposal
Unique should return multiple
NaN
s (in-line with previous NumPy behavior and other libraries).Specify sort order for floating-point values (see gh-288)
To return only a single
NaN
, users can implement a similar workaround to sklearn:There would be some additional work for getting the
indices
,inverse_indices
, andcounts
.Another workaround for users is to select a sentinel value other than
NaN
to indicate a “missing” value for the purposes of unique. For example, if values are known to be integers on the interval[1,100]
, replaceNaN
values with0
before callingunique
.In short, the issue of multiple
NaN
values seems, to me at least, to be a user concern and should thus be pushed to userland. While the utility of multipleNaN
values in the output ofunique
is undoubtedly limited, I don’t see why the prevention of this should be the responsibility of array libraries, especially when bearing this responsibility entails a performance cost.I think we have to make counts and values consistent with one another, regardless of which nan approach we take. It would be nonsensical to return something like
values == array([nan, nan, nan, nan, nan]); counts = array([5])
. Thenvalues
andcounts
wouldn’t even correspond to each other any more.