question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Approx_percentile() implementation gives wrong results with accuracy specified as param

See original GitHub issue

Currently in presto documentation, when we are trying to find percentiles with more accurate results we are supposed to use Approx_percentile() which as per the documentation has this syntax :

image

By default the accuracy is set to 0.01 but this can be changed in the syntax when called with smaller value giving more accurate results.

But somehow when I pass accuracy as a parameter to my query, the results are not correct. I ran the below code:

with temp AS (
  SELECT 1 AS num
  UNION
  SELECT 5 AS num
  UNION
  SELECT 10 AS num
  UNION
  SELECT 100 AS num
  UNION
  SELECT 200 AS num
  UNION
  SELECT 500 AS num
  UNION
  SELECT 1000 AS num
  UNION
  SELECT 10000 AS num
  UNION
  SELECT 20000 AS num
)
SELECT
  APPROX_PERCENTILE(num, 0.5),
  APPROX_PERCENTILE(num, 0.5, 0.01),
  APPROX_PERCENTILE(num, 0.5, 0.5),
  APPROX_PERCENTILE(num, 0.5, 0.5, 0.001)
FROM temp

In the code above results for first two aggregations should be same since default accuracy is 0.01 as mentioned HERE

But the results I get are completely off :

image

As we can see the first function gives the correct median value while the second doesn’t.

My intuition is that somehow instead of calling approx_percentile() with accuracy, presto is calling approx_percentile() with weight specified.

i.e. even though it should call THIS

image

I feel its somehow calling THIS

image

Another Suspected issue for this could be the way sql actually calls the underlying java functions: I found that sql expected this kind of params in its function calls :

i.e. Nowhere it has a function defined for approx_percentile(bigint, BIGINT, double) which should have been for case of weight

image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kunalkohlicommented, Jul 2, 2020

Thanks guys for clearing this confusion. Closing this issue for now.

Also the current prestodb implementation expects weight to be an INTEGER if we call approx_percentile(x, w, percentage) but in prestosql implementation we except weight to be a double as can be seen here

1reaction
mbasmanovacommented, Jul 2, 2020

@kunalkohli prestosql.io is a fork of this project, prestodb.io. If you are using PrestoDB you should use documentation from prestodb.io. If you are using PrestoSQL you should use the other documentation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fast and Accurate Percentiles with APPROX_PERCENTILE in ...
SingleStore introduced an APPROX_PERCENTILE function to estimate percentiles for a distribution of values. This function is faster and ...
Read more >
Using approx_percentile function in Presto | Best Practices
The Presto approx percentile returns an approximate percentile for values. We'll explain how to use the function.
Read more >
Aggregate Functions — Presto 0.278 Documentation
Returns the approximate percentile for all input values of x at each of the specified percentages. Each element of the percentages array must...
Read more >
apache spark - How to use approxQuantile by group?
approxQuantile() , as answered here: https://stackoverflow.com/a/51933027. But it's possible to do both grouping and percentiles in SQL syntax.
Read more >
HyperLogLog in Presto: Faster cardinality estimation
To speed up these queries, we implemented an algorithm called ... produce a drastically inaccurate (overestimated) estimate of cardinality.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found