Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: truncated mean function on qdigest

See original GitHub issue

I’d like to propose a new scalar function on the qdigest type. The quantile digest is used internally to compute approx percentiles and was recently exposed as an aggregate function in its own right.

Currently the only scalar function on qdigest is value_at_quantile(qdigest(T), quantile) → T which extracts the approx percentile from the q-digest. The new function I require would have the signature truncated_mean(qdigest(T), lower_quantile, upper_quantile) -> double i.e. take a q-digest and a quantile range (e.g. 0.1 -> 0.9) and return the approx mean of values which fall in that range of the distribution. This is commonly known as the truncated or trimmed mean.

I had a look into the library Presto uses for working with qdigest, airlift, and i think this library already exposes a function which could do this: https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java#L443 (you can create a histogram over a qdigest; this yields Bucket objects which have a method to compute the mean). The linked function has notes on the error bounds of this approach

I would be happy to implement this, but I would first like to know if this new function would be accepted, and if the approach is sensible.

Thanks

Issue Analytics

State:
Created 5 years ago
Reactions:4
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

mbasmanovacommented, Jan 22, 2019

@blrnw3 Ben, I’m thinking of adding a function that computes an approximate truncated mean from raw values. The implementation would build the qdigest and use it to compute truncated mean just like the function you suggested would.

truncated_mean(value, lower_percentile, upper_percentile)
trancated_mean(qdigest, lower_percentile, upper_percentile)

1reaction

blrnw3commented, Jan 21, 2019

One of the top uses cases for my team is for p90 truncated means, i.e. the mean of values in the 0 -> 0.9 quantile range. We are tracking a number of metrics with highly skewed distributions, e.g. message send time, latency etc. For these metrics we report a number of percentile values, e.g. p50, p90, p99, but also want a feel for the mean value. However, the upper tail is so extreme that the normal mean is useless, so we also calculate the p90 truncated mean, which is a valuable addition to the percentile values.

The additional complexity is that we calculate these statistics across a large number of cuts of the data, requiring a large number of queries. Exposing truncated means through qdigest would be a significant optimisaton of our pipelines.

Top Results From Across the Web

Quantile Digest Functions — Presto 0.278 Documentation

Merges all input qdigest s into a single qdigest . Returns the approximate percentile values from the quantile digest given the number quantile...

Metropolis-Hastings acceptance ratio for truncated proposal

It's strange to ask for a truncated proposal distribution, instead of limiting the estimated parameter to be non-negative(e.g. weight, height).

Structure-Aware Sampling: Flexible and Accurate ... - DIMACS

In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. ... 1, 9, 29] including the popular Q-digest [22]....

An Empirical Study of Moment Estimators for Quantile ...

Our analysis highlights the effectiveness of variants of moment-based quantile approximation for highly space efficient summaries: their average performance ...

Structure-Aware Sampling on Data Streams - CiteSeerX

proposed [1, 8, 19]. For example, the popular Q-digest gives de- terministic guarantees for range queries, with error bounded by a.