question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

_weighted_percentile does not lead to the same result than np.median

See original GitHub issue

While reviewing a test in https://github.com/scikit-learn/scikit-learn/pull/16937, it appears that our implementation of _weighted_percentile with unit sample_weight will lead to a different result than np.median which is a bit problematic for consistency.

In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile there.

import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile

rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)

median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)

assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, May 28, 2020

I think that our _weighted_precentile should offer the same default and maybe an interpolation parameter if required.

1reaction
glemaitrecommented, May 28, 2020

ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our _weighted_percentile

Read more comments on GitHub >

github_iconTop Results From Across the Web

Weighted percentile using numpy - python - Stack Overflow
I.e. you have three elements. I would expect it's 0.5 quantile to be median (which is true in both cases) and 0.33 quantile...
Read more >
Numpy uncovered: A beginner's guide to statistics using Numpy
This becomes important in skewed datasets, datasets whose values are not distributed evenly. PERCENTILES. As we know, the median is the middle ...
Read more >
Percentile - Wikipedia
One method extends the above approach in a natural way. The 50% weighted percentile is known as the weighted median.
Read more >
Percentiles - Online Statistics Book
A third way to compute percentiles (presented below) is a weighted average of the percentiles computed according to the first two definitions. This...
Read more >
numpy.quantile — NumPy v1.24 Manual
If the input contains integers or floats smaller than float64 , the output data-type is float64 . Otherwise, the output data-type is the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found