Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extending the `nan_policy` guidelines to cases not currently covered

See original GitHub issue

Over in https://github.com/scipy/scipy/pull/13572, there has been a discussion of how scipy.stats.percentileofscore should handle the nan_policy parameter. Here I want to have a general design discussion about that issue, since it is related to the use of nan_policy in several other functions.

There are cases where we want to add the nan_policy parameter (and even some cases where have already done so) that are not covered by the current guidelines in A Design Specification for nan_policy.

The functions that motivate this issue include:

np.percentile, np.nanpercentile
percentileofscore (see https://github.com/scipy/scipy/pull/13572)
scoreatpercentile
zmap
zscore

(Names without the np prefix are from scipy.stats.)

The use of nan_policy makes sense for inputs that are “set-like”: the input is a collection of numbers (or, more generally, a set of points in ℝⁿ) with no special order, and the output shape does not depend on the number of points. For example, the input to a computation of the mean or the standard deviation of a collection of numbers can be considered to be set-like (and in fact, NumPy has np.nanmean(x) and np.nanstd(x) that implement the behavior that corresponds to nan_policy='omit'). In SciPy, a simple example is ttest_ind(a, b), where both a and b are set-like; nan_policy='omit' means ignore nan in both a and b. For pearsonr(x, y), we can interpret the input as a set of points {(x_i, y_i)}; the output shape does not depend on how many points there are. In this case, nan_policy='omit' should remove any pair (x_i, y_i) where either x_i or y_i is nan.

nan_policy is not useful for a simple element-wise functions such as np.sin(x). The shape of the output of np.sin(x) must be the same as the input, so there is no meaningful way to ignore a nan. This is also (typically) true for simple functions such as scipy.special.xlogy(x, y), where the shape of the output is determined by broadcasting the inputs, and again there is no meaningful way to ignore a nan.

The functions percentileofscore, scoreatpercentile, zmap and np.(nan)percentile have structurally similar APIs. The core operation accepts two inputs, one a set-like collection of numbers and the other a scalar. The return value of the core operation is a scalar. For example, here’s np.percentile:

In [27]: np.percentile([0.5, 1.5, 2.0, 3.1, 4, 5], 25)
Out[27]: 1.625

In [28]: np.percentile([0.5, 1.5, 2.0, 3.1, 4, 5], [25, 50])
Out[28]: array([1.625, 2.55 ])

And here is zmap (the order of the set-like input and the element-wise input is the opposite of that in np.percentile):

In [29]: zmap(4, [1, 3, 19, 20])
Out[29]: array([-0.76829903])

In [30]: zmap([4, 10, 12], [1, 3, 19, 20])                                                                                                     
Out[30]: array([-0.76829903, -0.08536656,  0.1422776 ])

(Note: ideally, the return value of zmap(4, [1, 3, 19, 20]) would be a scalar or a 0-d array. I’m not sure we can fix that now–maybe we can consider that a bug?)

Likewise, percentileofscore(a, score) will take a set-like parameter a and an element-wise parameter score (in master, score must be a scalar):

In [167]: percentileofscore([0.5, 1.5, 2.0, 3.1, 4, 5], 1.6)
Out[167]: 33.333333333333336

In [168]: percentileofscore([0.5, 1.5, 2.0, 3.1, 4, 5], [1.6, 2.7, 6.8])         
Out[168]: array([ 33.33333333,  50.        , 100.        ])

The first parameter a is set-like, so there is no question about how the occurrence of nan in a should be handled. The question raised in https://github.com/scipy/scipy/pull/13572 is how should nan_policy work when nan occurs in the element-wise input score. Because the function operates element-wise on score, there is no meaningful way to ignore a nan in that argument.

There are (at least) two options:

Option 1. nan_policy applies to only the set-like inputs. Any nan that occurs in an element-wise input is always propagated, regardless of nan_policy (just like np.sin(x) and xlogy(x, y) propagate nan). Option 2. When there is a nan in the element-wise argument and nan_policy is not 'propagate', raise an exception. That means we raise an exception even if nan_policy='omit', because there is no meaningful way to omit a nan from an element-wise parameter.

We already have examples in NumPy and SciPy of these two alternative. np.nanpercentile ignores nan in the first argument:

In [12]: np.nanpercentile([0.5, 1.5, 2.0, 3.1, 4, np.nan, 5], [25, 50])
Out[12]: array([1.625, 2.55 ])

It will raise an exception if there is a nan in the second argument (which is an element-wise parameter), so it implements the behavior of option 2:

In [13]: np.nanpercentile([0.5, 1.5, 2.0, 3.1, 4, np.nan, 5], [25, np.nan])
<snip>
ValueError: Percentiles must in the range [0, 100]

In SciPy, we have already implemented option 1 for zmap. This is explained in its docstring:

nan_policy : {'propagate', 'raise', 'omit'}, optional
    Defines how to handle the occurrence of nans in `compare`.
    'propagate' returns nan, 'raise' raises an exception, 'omit'
    performs the calculations ignoring nan values. Default is
    'propagate'. Note that when the value is 'omit', nans in `scores`
    also propagate to the output, but they do not affect the z-scores
    computed for the non-nan values.

For example,

In [17]: zmap([4, np.nan, 12], [1, 3, 19, 20], nan_policy='omit')   
Out[17]: array([-0.76829903,         nan,  0.1422776 ])

The precedent set by zmap is an argument for doing the same in percentileofscore.

There is a further complication. Not all inputs can be classified as only set-like or only element-wise. Consider zscore(a). The function is simply zmap(a, a), so its one parameter a is both set-like (in how the input values affect the values of the output) and element-wise (it returns a transformed array that is the same size as a). nan_policy has already been implemented for zmap, and it works by applying the policy to the set-like operation. (This is because the implementation of zscore is just a one line call of zmap, so it inherits the behavior of zmap.) Here’s an example with no nan values:

In [28]: zscore([1, 3, 19, 20])  # No nan values                            
Out[28]: array([-1.10976527, -0.88212111,  0.93903215,  1.05285423])

nan_policy='propagate' with nan in the input does what we would expect:

In [29]: zscore([1, 3, 19, 20, np.nan], nan_policy='propagate')
Out[29]: array([nan, nan, nan, nan, nan])

With nan_policy='omit', the parameters of the transformation are computed from a with the nan ignored, and then the transformation is applied to each input value, including the nan, so we get a corresponding nan in the output:

In [30]: zscore([1, 3, 19, 20, np.nan], nan_policy='omit')
Out[30]: array([-1.10976527, -0.88212111,  0.93903215,  1.05285423,         nan])

Other functions like this, where the parameters of the transformation depend on all the values in the input (so we can treat the input as set-like when computing those parameters) and then the transformation is applied to each element (so the transformation is applied element-wise) include boxcox(x) (i.e. when the second argument lmbda is None) and scipy.special.softmax. I don’t think there is a strong demand for adding nan_policy to these functions, but if we did, the behavior could follow that of zscore.

The design decision to be made is the choice of option 1 or 2 explained above. I lean towards 1 (nan_policy applies only to set-like operations). In the examples considered so far, it isn’t that hard to understand the concept. We also have the precedence of the behaviors implemented in zmap and zscore. However, we may encounter more complicated cases where the notion of a “set-like” parameter or operation becomes less well-defined or harder to explain.

Issue Analytics

State:
Created 2 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

mdhabercommented, Dec 2, 2021

@Kai-Striega @tupui @tirthasheshpatel thought you might have an opinion here that could help make the decision?

0reactions

Kai-Striegacommented, Dec 7, 2021

+1 for option 1 from me. I find dealing with nan policies as a developer can be confusing and this seems like the more intuitive option. Has anyone looked to see if this is already the way most nan-policies are implemented?