Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

_weighted_percentile does not lead to the same result than np.median

See original GitHub issue

While reviewing a test in https://github.com/scikit-learn/scikit-learn/pull/16937, it appears that our implementation of _weighted_percentile with unit sample_weight will lead to a different result than np.median which is a bit problematic for consistency.

In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile there.

import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile

rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)

median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)

assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)

Issue Analytics

State:
Created 3 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, May 28, 2020

I think that our _weighted_precentile should offer the same default and maybe an interpolation parameter if required.

1reaction

glemaitrecommented, May 28, 2020

ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our _weighted_percentile

Top Results From Across the Web

Weighted percentile using numpy - python - Stack Overflow

I.e. you have three elements. I would expect it's 0.5 quantile to be median (which is true in both cases) and 0.33 quantile...

Numpy uncovered: A beginner's guide to statistics using Numpy

This becomes important in skewed datasets, datasets whose values are not distributed evenly. PERCENTILES. As we know, the median is the middle ...

Percentile - Wikipedia

One method extends the above approach in a natural way. The 50% weighted percentile is known as the weighted median.

Percentiles - Online Statistics Book

A third way to compute percentiles (presented below) is a weighted average of the percentiles computed according to the first two definitions. This...

numpy.quantile — NumPy v1.24 Manual

If the input contains integers or floats smaller than float64 , the output data-type is float64 . Otherwise, the output data-type is the...