_weighted_percentile does not lead to the same result than np.median
See original GitHub issueWhile reviewing a test in https://github.com/scikit-learn/scikit-learn/pull/16937, it appears that our implementation of _weighted_percentile
with unit sample_weight
will lead to a different result than np.median
which is a bit problematic for consistency.
In the gradient-boosting, it brakes the loss equivalence because the initial predictions are different. We could bypass this issue by always computing the median using _weighted_percentile
there.
import pytest
import numpy as np
from sklearn.utils.stats import _weighted_percentile
rng = np.random.RandomState(42)
X = rng.randn(10)
X.sort()
sample_weight = np.ones(X.shape)
median_numpy = np.median(X)
median_numpy_percentile = np.percentile(X, 50)
median_sklearn = _weighted_percentile(X, sample_weight, percentile=50.0)
assert median_numpy == pytest.approx(np.mean(X[[4, 5]]))
assert median_sklearn == pytest.approx(X[4])
assert median_numpy == median_numpy_percentile
assert median_sklearn == pytest.approx(median_numpy)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Weighted percentile using numpy - python - Stack Overflow
I.e. you have three elements. I would expect it's 0.5 quantile to be median (which is true in both cases) and 0.33 quantile...
Read more >Numpy uncovered: A beginner's guide to statistics using Numpy
This becomes important in skewed datasets, datasets whose values are not distributed evenly. PERCENTILES. As we know, the median is the middle ...
Read more >Percentile - Wikipedia
One method extends the above approach in a natural way. The 50% weighted percentile is known as the weighted median.
Read more >Percentiles - Online Statistics Book
A third way to compute percentiles (presented below) is a weighted average of the percentiles computed according to the first two definitions. This...
Read more >numpy.quantile — NumPy v1.24 Manual
If the input contains integers or floats smaller than float64 , the output data-type is float64 . Otherwise, the output data-type is the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think that our
_weighted_precentile
should offer the same default and maybe aninterpolation
parameter if required.ping @lucyleeow Since you already look at the code, you might have some intuition why this is the case. We should actually have the above test as a regression test for our
_weighted_percentile