question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: truncated mean function on qdigest

See original GitHub issue

I’d like to propose a new scalar function on the qdigest type. The quantile digest is used internally to compute approx percentiles and was recently exposed as an aggregate function in its own right.

Currently the only scalar function on qdigest is value_at_quantile(qdigest(T), quantile) → T which extracts the approx percentile from the q-digest. The new function I require would have the signature truncated_mean(qdigest(T), lower_quantile, upper_quantile) -> double i.e. take a q-digest and a quantile range (e.g. 0.1 -> 0.9) and return the approx mean of values which fall in that range of the distribution. This is commonly known as the truncated or trimmed mean.

I had a look into the library Presto uses for working with qdigest, airlift, and i think this library already exposes a function which could do this: https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java#L443 (you can create a histogram over a qdigest; this yields Bucket objects which have a method to compute the mean). The linked function has notes on the error bounds of this approach

I would be happy to implement this, but I would first like to know if this new function would be accepted, and if the approach is sensible.

Thanks

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:4
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mbasmanovacommented, Jan 22, 2019

@blrnw3 Ben, I’m thinking of adding a function that computes an approximate truncated mean from raw values. The implementation would build the qdigest and use it to compute truncated mean just like the function you suggested would.

  • truncated_mean(value, lower_percentile, upper_percentile)
  • trancated_mean(qdigest, lower_percentile, upper_percentile)
1reaction
blrnw3commented, Jan 21, 2019

One of the top uses cases for my team is for p90 truncated means, i.e. the mean of values in the 0 -> 0.9 quantile range. We are tracking a number of metrics with highly skewed distributions, e.g. message send time, latency etc. For these metrics we report a number of percentile values, e.g. p50, p90, p99, but also want a feel for the mean value. However, the upper tail is so extreme that the normal mean is useless, so we also calculate the p90 truncated mean, which is a valuable addition to the percentile values.

The additional complexity is that we calculate these statistics across a large number of cuts of the data, requiring a large number of queries. Exposing truncated means through qdigest would be a significant optimisaton of our pipelines.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quantile Digest Functions — Presto 0.278 Documentation
Merges all input qdigest s into a single qdigest . Returns the approximate percentile values from the quantile digest given the number quantile...
Read more >
Metropolis-Hastings acceptance ratio for truncated proposal
It's strange to ask for a truncated proposal distribution, instead of limiting the estimated parameter to be non-negative(e.g. weight, height).
Read more >
Structure-Aware Sampling: Flexible and Accurate ... - DIMACS
In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. ... 1, 9, 29] including the popular Q-digest [22]....
Read more >
An Empirical Study of Moment Estimators for Quantile ...
Our analysis highlights the effectiveness of variants of moment-based quantile approximation for highly space efficient summaries: their average performance ...
Read more >
Structure-Aware Sampling on Data Streams - CiteSeerX
proposed [1, 8, 19]. For example, the popular Q-digest gives de- terministic guarantees for range queries, with error bounded by a.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found