Proposal: truncated mean function on qdigest
See original GitHub issueI’d like to propose a new scalar function on the qdigest type. The quantile digest is used internally to compute approx percentiles and was recently exposed as an aggregate function in its own right.
Currently the only scalar function on qdigest is value_at_quantile(qdigest(T), quantile) → T
which extracts the approx percentile from the q-digest. The new function I require would have the signature
truncated_mean(qdigest(T), lower_quantile, upper_quantile) -> double
i.e. take a q-digest and a quantile range (e.g. 0.1 -> 0.9) and return the approx mean of values which fall in that range of the distribution. This is commonly known as the truncated or trimmed mean.
I had a look into the library Presto uses for working with qdigest, airlift, and i think this library already exposes a function which could do this: https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java#L443 (you can create a histogram over a qdigest; this yields Bucket objects which have a method to compute the mean). The linked function has notes on the error bounds of this approach
I would be happy to implement this, but I would first like to know if this new function would be accepted, and if the approach is sensible.
Thanks
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:7 (6 by maintainers)
Top GitHub Comments
@blrnw3 Ben, I’m thinking of adding a function that computes an approximate truncated mean from raw values. The implementation would build the qdigest and use it to compute truncated mean just like the function you suggested would.
One of the top uses cases for my team is for p90 truncated means, i.e. the mean of values in the 0 -> 0.9 quantile range. We are tracking a number of metrics with highly skewed distributions, e.g. message send time, latency etc. For these metrics we report a number of percentile values, e.g. p50, p90, p99, but also want a feel for the mean value. However, the upper tail is so extreme that the normal mean is useless, so we also calculate the p90 truncated mean, which is a valuable addition to the percentile values.
The additional complexity is that we calculate these statistics across a large number of cuts of the data, requiring a large number of queries. Exposing truncated means through qdigest would be a significant optimisaton of our pipelines.