Extending the `nan_policy` guidelines to cases not currently covered
See original GitHub issueOver in https://github.com/scipy/scipy/pull/13572, there has been a discussion of how scipy.stats.percentileofscore
should handle the nan_policy
parameter. Here I want to have a general design discussion about that issue, since it is related to the use of nan_policy
in several other functions.
There are cases where we want to add the nan_policy
parameter (and even some cases where have already done so) that are not covered by the current guidelines in A Design Specification for nan_policy
.
The functions that motivate this issue include:
np.percentile
,np.nanpercentile
percentileofscore
(see https://github.com/scipy/scipy/pull/13572)scoreatpercentile
zmap
zscore
(Names without the np
prefix are from scipy.stats
.)
The use of nan_policy
makes sense for inputs that are “set-like”: the input is a collection of numbers (or, more generally, a set of points in ℝⁿ) with no special order, and the output shape does not depend on the number of points. For example, the input to a computation of the mean or the standard deviation of a collection of numbers can be considered to be set-like (and in fact, NumPy has np.nanmean(x)
and np.nanstd(x)
that implement the behavior that corresponds to nan_policy='omit'
). In SciPy, a simple example is ttest_ind(a, b)
, where both a
and b
are set-like; nan_policy='omit'
means ignore nan
in both a
and b
. For pearsonr(x, y)
, we can interpret the input as a set of points {(x_i, y_i)}; the output shape does not depend on how many points there are. In this case, nan_policy='omit'
should remove any pair (x_i, y_i) where either x_i or y_i is nan
.
nan_policy
is not useful for a simple element-wise functions such as np.sin(x)
. The shape of the output of np.sin(x)
must be the same as the input, so there is no meaningful way to ignore a nan
. This is also (typically) true for simple functions such as scipy.special.xlogy(x, y)
, where the shape of the output is determined by broadcasting the inputs, and again there is no meaningful way to ignore a nan
.
The functions percentileofscore
, scoreatpercentile
, zmap
and np.(nan)percentile
have structurally similar APIs. The core operation accepts two inputs, one a set-like collection of numbers and the other a scalar. The return value of the core operation is a scalar. For example, here’s np.percentile
:
In [27]: np.percentile([0.5, 1.5, 2.0, 3.1, 4, 5], 25)
Out[27]: 1.625
In [28]: np.percentile([0.5, 1.5, 2.0, 3.1, 4, 5], [25, 50])
Out[28]: array([1.625, 2.55 ])
And here is zmap
(the order of the set-like input and the element-wise input is the opposite of that in np.percentile
):
In [29]: zmap(4, [1, 3, 19, 20])
Out[29]: array([-0.76829903])
In [30]: zmap([4, 10, 12], [1, 3, 19, 20])
Out[30]: array([-0.76829903, -0.08536656, 0.1422776 ])
(Note: ideally, the return value of zmap(4, [1, 3, 19, 20])
would be a scalar or a 0-d array. I’m not sure we can fix that now–maybe we can consider that a bug?)
Likewise, percentileofscore(a, score)
will take a set-like parameter a
and an element-wise parameter score
(in master, score
must be a scalar):
In [167]: percentileofscore([0.5, 1.5, 2.0, 3.1, 4, 5], 1.6)
Out[167]: 33.333333333333336
In [168]: percentileofscore([0.5, 1.5, 2.0, 3.1, 4, 5], [1.6, 2.7, 6.8])
Out[168]: array([ 33.33333333, 50. , 100. ])
The first parameter a
is set-like, so there is no question about how the occurrence of nan
in a
should be handled. The question raised in https://github.com/scipy/scipy/pull/13572 is how should nan_policy
work when nan
occurs in the element-wise input score
. Because the function operates element-wise on score
, there is no meaningful way to ignore a nan
in that argument.
There are (at least) two options:
Option 1. nan_policy
applies to only the set-like inputs. Any nan
that occurs in an element-wise input is always propagated, regardless of nan_policy
(just like np.sin(x)
and xlogy(x, y)
propagate nan
).
Option 2. When there is a nan
in the element-wise argument and nan_policy
is not 'propagate'
, raise an exception. That means we raise an exception even if nan_policy='omit'
, because there is no meaningful way to omit a nan
from an element-wise parameter.
We already have examples in NumPy and SciPy of these two alternative. np.nanpercentile
ignores nan
in the first argument:
In [12]: np.nanpercentile([0.5, 1.5, 2.0, 3.1, 4, np.nan, 5], [25, 50])
Out[12]: array([1.625, 2.55 ])
It will raise an exception if there is a nan
in the second argument (which is an element-wise parameter), so it implements the behavior of option 2:
In [13]: np.nanpercentile([0.5, 1.5, 2.0, 3.1, 4, np.nan, 5], [25, np.nan])
<snip>
ValueError: Percentiles must in the range [0, 100]
In SciPy, we have already implemented option 1 for zmap
. This is explained in its docstring:
nan_policy : {'propagate', 'raise', 'omit'}, optional
Defines how to handle the occurrence of nans in `compare`.
'propagate' returns nan, 'raise' raises an exception, 'omit'
performs the calculations ignoring nan values. Default is
'propagate'. Note that when the value is 'omit', nans in `scores`
also propagate to the output, but they do not affect the z-scores
computed for the non-nan values.
For example,
In [17]: zmap([4, np.nan, 12], [1, 3, 19, 20], nan_policy='omit')
Out[17]: array([-0.76829903, nan, 0.1422776 ])
The precedent set by zmap
is an argument for doing the same in percentileofscore
.
There is a further complication. Not all inputs can be classified as only set-like or only element-wise. Consider zscore(a)
. The function is simply zmap(a, a)
, so its one parameter a
is both set-like (in how the input values affect the values of the output) and element-wise (it returns a transformed array that is the same size as a
). nan_policy
has already been implemented for zmap
, and it works by applying the policy to the set-like operation. (This is because the implementation of zscore
is just a one line call of zmap
, so it inherits the behavior of zmap
.) Here’s an example with no nan
values:
In [28]: zscore([1, 3, 19, 20]) # No nan values
Out[28]: array([-1.10976527, -0.88212111, 0.93903215, 1.05285423])
nan_policy='propagate'
with nan
in the input does what we would expect:
In [29]: zscore([1, 3, 19, 20, np.nan], nan_policy='propagate')
Out[29]: array([nan, nan, nan, nan, nan])
With nan_policy='omit'
, the parameters of the transformation are computed from a
with the nan
ignored, and then the transformation is applied to each input value, including the nan
, so we get a corresponding nan
in the output:
In [30]: zscore([1, 3, 19, 20, np.nan], nan_policy='omit')
Out[30]: array([-1.10976527, -0.88212111, 0.93903215, 1.05285423, nan])
Other functions like this, where the parameters of the transformation depend on all the values in the input (so we can treat the input as set-like when computing those parameters) and then the transformation is applied to each element (so the transformation is applied element-wise) include boxcox(x)
(i.e. when the second argument lmbda
is None) and scipy.special.softmax
. I don’t think there is a strong demand for adding nan_policy
to these functions, but if we did, the behavior could follow that of zscore
.
The design decision to be made is the choice of option 1 or 2 explained above. I lean towards 1 (nan_policy
applies only to set-like operations). In the examples considered so far, it isn’t that hard to understand the concept. We also have the precedence of the behaviors implemented in zmap
and zscore
. However, we may encounter more complicated cases where the notion of a “set-like” parameter or operation becomes less well-defined or harder to explain.
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (10 by maintainers)
@Kai-Striega @tupui @tirthasheshpatel thought you might have an opinion here that could help make the decision?
+1 for option 1 from me. I find dealing with nan policies as a developer can be confusing and this seems like the more intuitive option. Has anyone looked to see if this is already the way most nan-policies are implemented?