Improve quantile estimation accuracy with T-Digest
See original GitHub issueThe underlying algorithm behind approx_percentile
uses q-digest, which ensures uniform error across the distribution. Empirically, it has been shown that using a biased estimator like t-digest can improve accuracy for common quantile queries at the tails with a nominal or no decrease in accuracy in the median.
We would like to investigate implementing t-digest in Presto to improve:
- Quantile queries at the high tails, such as P999 queries.
- Compute efficiency for common quantile queries. T-Digest has a smaller in memory footprint, should require less computation to merge and serialize, and has smaller overhead to add new entries to the distribution.
To implement this, we’d like to add the T-Digest algorithm directly in Presto. Once we’re happy with the implementation, we can proceed to:
- Add a new T-Digest type (similar to Q-Digest)
- This will include
- Serialization/deserialization, aggregation functions to merge
- Creation aggregation functions
- Merging aggregation function
- The ability to cast to VARBINARY for serialization
- This will include
- Use T-Digest for
approx_percentile
overDOUBLE
- This will be a change in behavior that will be gated initially by a system property
- There will be versions which internally use T-Digest which will be used when the property is enabled
- This will not be used for the overload of
approx_percentile
which explicitly specifies the accuracy, as this could no longer be guaranteed
- Add truncated mean function
Longer term, we can investigate some other things:
- Adding support for more types
- Migrating internal usage of q-digest for stats collection to t-digest
- An implementation which which grows O(log(N)) with the data distribution but ensures a proven accuracy bound, for cases where a minimum uniform accuracy is required
There are several open questions which need to be answered for us to proceed:
- What precisely are the performance benefits for adding, merging, serializing, deserializing and querying, relative to q-digest?
- How much relative storage can we expect to save by using t-digest?
- What is a sensible compression factor to use as a default, which gives us a median accuracy at least as good as q-digest (i.e. a compression factor which results in error <= +/- 1% of the rank of the true quantile value in the sorted set of values)
- What is the impact on accuracy when merging? Would small splits negatively impact the accuracy to such a degree we degrade on the benchmark above?
- Can we represent the t-digest using flat data structures and constant memory (to minimize overhead and reduce GC)?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
The t-digest: Efficient estimates of distributions - ScienceDirect
The t-digest is an algorithm for accurately estimating quantiles from a compact sketch. •. The t-digest is available as a library as well...
Read more >Computing Extremely Accurate Quantiles Using t-Digests
The overall effect is that quantile estimation accuracy is dramatically improved at the extremes but only modestly impaired near the median. Exactly how...
Read more >T-Digest: An interesting datastructure to estimate quantiles ...
Conclusion: T-Digest provides accurate on-line estimates of a variety of rank-based statistics including quantiles and trimmed mean.
Read more >Percentile and Quantile Estimation of Big Data: The t-Digest
Running a small test locally, I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate...
Read more >T-Digest in Python
T-Digest is a structure that can determine the percentile of a ... it is small enough to get accurate quantile estimates by interpolation, ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
T-Digest Benchmarks
Over the past few weeks, I have been working and experimenting with the t-digest data structure to potentially replace Q-Digest in Presto. T-Digest is made up of a collection of centroids, each associated with a mean and a weight (number of elements in the centroid), ordered by ascending mean. This digest allows you to collect specific stats from a data set using very limited space and time. One of the main uses for t-digest is the ability to calculate quantiles more efficiently. Below, you will be able to find data on different benchmarks to compare the relative performance of T-Digest to Q-Digest on varying distributions.
Unless otherwise stated, all data used was created by taking 1,000,000 samples from a normal distribution. The T-Digests used were built using a compression factor of 100, while Q-Digests were built using 0.01 as the maximum error parameter.
Time
One of the main advantages of t-digest is that its very fast, allowing you to build and perform operations on this data structure in a few thousand nanoseconds. Here are some of the runtimes of common digest operations:
As seen in the graph above, t-digest outperforms q-digest in every single operation measured. It is worth noting serialization time, which is done about 70x faster since we need to store significantly less bytes of data (more on this in below). This overall increase in performance makes sense due to the underlying data structures used to represent each digest. Q-digest uses a tree, so we need to traverse through it for every operation. Meanwhile, t-digest simply uses a collection of centroids, which is usually smaller than the compression factor. We can perform each operation faster because the collection is indexed, allowing us to get to the specific centroid we want instead of having to iterate through all centroids. While testing the merge operation, I observed that the runtime is proportional to the compression factor, so merging thousands of digests with 1,000 or 10,000 as the compression factor can take a long time.
I also tested how the time to insert each element changes as the total number of elements being added increases. I expected it to remain somewhat constant, since the number of centroids should be relatively constant and shouldn’t exceed the compression factor. The results can be seen in the graph below.
As expected, we can see that once we reach 10,000+ elements, the time per insertion remains constant at around 65 ns/insertion. In addition, we can see that each insertion into a t-digest is about 3x faster than inserting into q-digest, which is important when building a digest to analyze massive amounts of data.
Storage
Another quality of t-digest is that it uses very little space to store information on the data that is added into it. This is excellent because one of the potential uses of t-digest could be saving t-digests and coming back to them in the future. Therefore, if we can optimize how many bytes of memory we need to store a digest, we would be able to store way more digests and use up less memory space. Below you will find comparisons between the number of bytes needed to serialize a t-digest compared to a q-digest for some of the most common distributions.
In the graph above, we can see the storage performance between q-digest and t-digest. Q-digest uses less space when the data is compacted rather than spread out. On the other hand, we can see how t-digest uses a similar amount of memory space regardless of the distribution or how spread out the values are. Therefore, we can see a massive difference when comparing performance over uniform distributions, where q-digest consumes over 100x more memory (graph is not scaled for that distribution). Even when the data is compact and q-digest can be stored with fewer bytes, the savings are almost negligible.
Accuracy
While runtime and memory usage can be good benchmarks, we must test accuracy to ensure that t-digest is really worth implementing over q-digest. To do this, I used samples from the same distributions as above to test the accuracy of t-digest when retrieving quantiles. I used varying compression factors for t-digest to determine which alternative offers the best space-accuracy benefits. In addition, I tested the effects of merging multiple distributions together, which appears to have no significant impact on accuracy. Below, you will find the accuracy plots for normal and uniform distributions. The remaining distributions were tested using Java unit tests, and all of them passed with less than 1% error.
As seen in the graphs above, t-digest returns a better accuracy for almost all data points. In addition, we can see how accuracy of the t-digest increases as we increase the compression factor. For general purposes, it makes sense to use 100 as a standard compression factor, but if extremely high accuracy is required, it might be worth using a higher compression factor (which will take up more space in memory).
Conclusions
Overall, I believe t-digest is definitely worth implementing into Presto. It performs better than q-digest for most data points across all benchmarks. In terms of runtime, every operation performed on t-digest takes significantly less time compared to q-digest (at least 3x faster for each metric). In addition, when distributions are sparse, t-digest offers massive storage savings, sometimes using up to 100x less bytes to store data. Finally, in terms of accuracy, t-digest once again outperforms q-digest, especially when retrieving quantiles at the tails. Therefore, I think t-digest will be an excellent addition to Presto, with promising improvements for all functions that currently use q-digest.
T-Digest
Over the past 12 weeks, I have been working and experimenting with the t-digest data structure to potentially replace Quantile Digest in Presto. T-Digest is made up of a collection of centroids, each associated with a mean and a weight (number of elements in the centroid), ordered by ascending mean. This digest allows you to collect specific stats from a data set using very limited space and time. One of the main uses for t-digest is the ability to calculate quantiles more efficiently. Below, you will be able to find data on different benchmarks to compare the relative performance of T-Digest to Q-Digest on varying distributions.
Unless otherwise stated, all data used was created by aggregating across the
totalprice
column intpch.sf2.orders
. The T-Digest structures were built using a compression factor of 100, while Q-Digest was built using 0.01 as the maximum error parameter.Time
One of the main advantages of t-digest is that it’s very fast, allowing you to build and perform operations on this data structure in a few thousand nanoseconds. Here are some of the runtimes of common digest operations, which I collected earlier during my internship.
While these time differences are very significant, I believe the most important difference which will impact our users is insertion speed. Whether we’re using
tdigest_agg
orapprox_percentile
, we need to insert each element from the column into a digest. Therefore, this is a key improvement in terms of query speed. The original benchmark can be seen in the graph below.Based on the benchmark, we can see how once we reach 10,000+ elements, the time per insertion remains constant at around 65 ns/insertion. In addition, we can see that each insertion into a t-digest is about 3x faster than inserting into q-digest, which is important when building a digest to analyze massive amounts of data. I’ve been able to see this 3x improvement constantly while running queries directly on the Presto engine, which shows that our initial benchmarks were a great estimate for the actual increase in query speed. A summary of the results for these benchmarks can be seen below:
Storage
Another quality of t-digest is that it uses very little space to store information on the data that is added into it. This is excellent because one of the potential uses of t-digest could be saving t-digests and coming back to them in the future. Therefore, if we can optimize how many bytes of memory we need to store a digest, we would be able to store way more digests and use up less memory space. Below you will find comparisons between the number of bytes required to serialize a t-digest compared to a q-digest for some of the most common distributions.
In the graph above, we can see the storage performance between q-digest and t-digest. Q-digest uses less space when the data is compacted rather than spread out. On the other hand, we can see how t-digest uses a similar amount of memory space regardless of the distribution or how spread out the values are. Therefore, we can see a massive difference when comparing performance over uniform distributions, where q-digest consumes over 100x more memory (graph is not scaled for that distribution). Even when the data is compact and q-digest can be stored with fewer bytes, the savings offered by q-digest are almost negligible.
When I originally integrated t-digest, every instance we initialized would create an two arrays for the centroids (mean and weight arrays) and three arrays which acted as buffers. While testing the
approx_percentile
function and grouping by different keys, I realized that many groups would initialize an empty t-digest but have no elements to add to it. However, it was taking up the same space in memory as a t-digest with millions of elements. This was causing my local heap to blow up when I had 300,000 or more groups, which is relatively small compared to the number of groups that a query might have at Facebook (sometimes in the billions). Therefore, I added an optimization to the original t-digest structure, where the five arrays are dynamically sized. Based on this optimization, an empty t-digest now takes up 220 bytes in memory, whereas previously it consumed about 17,000 bytes of memory (just like a full t-digest). In comparison, an empty q-digest takes up about 210 bytes, while a populated one (with data fromtotalprice
fromtpch.sf2.orders
) consumes about 200,000 bytes of memory. This value, however, shouldn’t be taken as the average space in memory used by a q-digest, as it can change significantly depending on the variance of the distribution used as input.Accuracy
While runtime and memory usage can be good benchmarks, we must test accuracy to ensure that t-digest is really worth implementing over q-digest. To do this, I used samples from the same distributions as above to test the accuracy of t-digest when retrieving quantiles. I used varying compression factors for t-digest to determine which alternative offers the best space-accuracy benefits. In addition, I tested the effects of merging multiple distributions together, which appears to have no significant impact on accuracy. After running several queries using tpch, I realized a few things about the accuracy of t-digest. First off, the error has never been over 1%. That said, there is no theoretical guarantee that this will never go over 1%, but so far it hasn’t. Another important observation is that t-digest is not deterministic. Two t-digests which have the same inputs added in the same order might produce different results. This made it harder to test accuracy, since I could not rely on the results from a single t-digest to compare with q-digest. Therefore, I ran the same query 10 times and averaged the results, but this too didn’t seem like the best approach, since it “hid” a few t-digest results which were farther away from the expected results compared to q-digest. Therefore, I ended up running the same query 10 times and using the worst result to compare the two digests. Nevertheless, the results were very promising, as seen in the graph below.
Here, we can clearly see how t-digest performed better than q-digest for most quantiles near the tails (< 5%). In addition, it did it 3x faster and consuming almost 30x less space in memory. However, as opposed to q-digest, there isn’t a clear trade-off between memory space and accuracy. If we were to increase the compression value of the t-digest by a factor of 10, this doesn’t mean we will get 10x more accurate results. Therefore, if we need to guarantee a very small maximum error (< 0.1%), then we should still use q-digest and absorb the memory and speed costs. However, for all other queries, it makes more sense to use t-digest since it’s proven to be faster, lighter, and usually more accurate.
Conclusions
Overall, I believe t-digest was definitely worth implementing for Presto. It performs better than q-digest for most data points across all benchmarks. In terms of runtime, every operation performed on t-digest takes significantly less time compared to q-digest (at least 3x faster for each metric). In addition, when distributions are sparse, t-digest offers massive storage savings, sometimes using up to 100x less bytes to store data. Finally, in terms of accuracy, t-digest once again outperforms q-digest, especially when retrieving quantiles at the tails. Therefore, I think t-digest will be an excellent addition to Presto, with promising improvements for
approx_percentile
and other functions that currently use q-digest.