question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GROUP BY column list order affects runtime

See original GitHub issue

Qry1: –takes 1min (UI json shows “cumulativeUserMemory” : 3.3699831825E11, “totalScheduledTime” : “19.26m”,“peakTaskTotalMemory” : “57267780B”, “info” : { “@type” : “exchangeClientStatus”, “bufferedBytes” : 0, “maxBufferedBytes” : 485348, “averageBytesPerRequest” : 12007, “successfulRequestsCount” : 1264, “bufferedPages” : 0, “noMoreLocations” : true, “pageBufferClientStatuses” : [ ] }

SELECT a,b,c,d,e,
sum(value_double) as value FROM s.t
WHERE d BETWEEN TIMESTAMP '2020-03-01' AND TIMESTAMP '2020-08-11'  
AND data_set_name IN ('e,'t')
GROUP BY d,b,a,e,c

Qry2: –takes 2mins (UI json shows “cumulativeUserMemory” : 1.058281153416E12, “totalScheduledTime” : “29.80m”,“peakTaskTotalMemory” : “105336700B”, “info” : { “@type” : “exchangeClientStatus”, “bufferedBytes” : 0, “maxBufferedBytes” : 485211, “averageBytesPerRequest” : 5506, “successfulRequestsCount” : 2752, “bufferedPages” : 0, “noMoreLocations” : true, “pageBufferClientStatuses” : [ ] }

SELECT a,b,c,d,e,
sum(value_double) as value FROM s.t
WHERE d BETWEEN TIMESTAMP '2020-03-01' AND TIMESTAMP '2020-08-11'  
AND data_set_name IN ('e,'t')
GROUP BY a,b,c,d,e

from slack: " Tao Wang Apr 3rd at 3:58 AM It’s possible to improve the performance of the GROUP BY function by carefully ordering a list of fields within the GROUP BY in an order of high cardinality I am a little confused about this tip. Based on my understanding, since the group by is implemented by hash in Presto, there should not be difference between different group-by order. Could someone shed light on this? Thanks! 15 replies

findepi 6 months ago Where this quote is from?

raunaqm 6 months ago I think this is mentioned by Athena and Treasure data in tuning guide. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ https://support.treasuredata.com/hc/en-us/articles/360001450908-Presto-Performance-Tuning Amazon Web ServicesAmazon Web Services Top 10 Performance Tuning Tips for Amazon Athena | Amazon Web Services This blog post has been translated into Japanese. Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply […] Mar 24th, 2017 (31 kB) https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2017/11/16/Top10.png Arm Treasure DataArm Treasure Data Presto Performance Tuning We have a new docs home, for this page visit our new documentation site! Prerequisites Basic knowledge of Arm Treasure Data. Basic knowledge of Presto query engine. … (7 kB) https://theme.zdassets.com/theme_assets/766587/78b016adfce27707ee53e066d510017a2788bdc9.png

findepi 6 months ago I don’t know why the wrote so. Maybe that was based on some old version? @Tao Wang in general — this certainly doesn’t matter, as you pointed out, everything is hash-based in a less general case, when the data is already partitioned by a subset of columns, we do not repartition it. but that doesn’t depend on the order of columns in GROUP BY either.

Tao Wang 6 months ago Thanks @findepi and @raunaqm! I did see this quote from arm treasure data . The reason i have this question is i have large performance difference with different order. Below are the same queries with different group by order. (edited)

Tao Wang 6 months ago – take about 10+ minutes select player_id_type, game_id_type, platform, release_type, source_type, country, mode_type, game_mode, game_type, game_difficulty, game_map, is_spender, is_new_player, subscription_type, event_source, cast(udf_enhanced_approx_set(player_id, cast(0.0115 as double)) as varbinary) as active_users_sketch, sum(event_count) as event_count, sum(session_duration_sum) as session_duration_sum, dt, pgroup, game_id from ${db}.${table} where dt=‘2020-03-16’ group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,19,20,21;

Tao Wang 6 months ago – take less than 3 minutes select player_id_type, game_id_type, platform, release_type, source_type, country, mode_type, game_mode, game_type, game_difficulty, game_map, is_spender, is_new_player, subscription_type, event_source, cast(udf_enhanced_approx_set(player_id, cast(0.0115 as double)) as varbinary) as active_users_sketch, sum(event_count) as event_count, sum(session_duration_sum) as session_duration_sum, dt, pgroup, game_id from ${db}.${table} where dt=‘2020-03-16’ group by 13,1,2,3,4,5,6,7,8,9,10,11,12,14,15,19,20,21;

findepi 6 months ago can you compare EXPLAIN plans for different orders and check what is the distribution (what columns are hashed) of indivudal fragments?

findepi 6 months ago since there is only TS + GROUP BY, there shouldn’t be any edge cases. but, the difference can be attributed to equals invocations when finding values within bucket in memory (let me checck) (edited)

Tao Wang 6 months ago Is it possible related to the hash collision?

findepi 6 months ago https://github.com/prestosql/presto/blob/fd5f4988df52dd49be6e95b681c0b87481ef4e4f/presto-main/src/main/java/io/prestosql/operator/MultiChannelGroupByHash.java#L286 presto-main/src/main/java/io/prestosql/operator/MultiChannelGroupByHash.java:286 if (positionNotDistinctFromCurrentRow(groupAddressByHash[hashPosition], hashPosition, position, page, (byte) rawHash, channels)) { https://github.com/prestosql/presto|prestosql/prestoprestosql/presto | Added by GitHub

findepi 6 months ago it’s not only about hash collisions per se (they are rare, if values are inequal), but it’s hash % size , to effective collisions are more frequent that should be filtered out by the rawHashByHashPosition[hashPosition] != rawHash in that method… (edited)

findepi 6 months ago I think there is something i am not seeing. @dain do you recall?

Tao Wang 6 months ago Thanks @findepi! The EXPLAIN plans are pretty the same except the different group-by column. Looking forward to dain’s reply. Thanks all.

findepi 6 months ago different group-by column just in different order, right?

Tao Wang 6 months ago yes, you are right. just in different order "

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ssquancommented, Oct 17, 2020

Maybe because of the combine_hash function when calculating the hash of columns in the GROUP BY clause. According to the source code, the hash value may overflow after the left value multiple 31.

    @ScalarFunction(value = "combine_hash", hidden = true)
    @SqlType(StandardTypes.BIGINT)
    public static long getHash(@SqlType(StandardTypes.BIGINT) long previousHashValue, @SqlType(StandardTypes.BIGINT) long value)
    {
        return (31 * previousHashValue + value);
    }
0reactions
tooptoop4commented, Nov 6, 2020

consistently different number of collisions based on order of the columns

Read more comments on GitHub >

github_iconTop Results From Across the Web

Does the order of columns matter in a group by clause?
No, the order doesn't matter for the GROUP BY clause. MySQL and SQLite are the only databases I'm aware of that allow you...
Read more >
SQL Server ORDER BY performance tips
The duration and resource (CPU and memory) consumption of the sort operation will change depending on the number of rows to be sorted....
Read more >
SQL GROUP BY to Summarize and Rollup Query Results
Since the column order affects the ROLLUP output, it can also affect the number of rows returned in the result set. The GROUP...
Read more >
In MySQL, does the order of the columns in a WHERE clause ...
I do not think so. The query optimizer should be clever enough. You can try rearranging the WHERE clauses and see that EXPLAINS...
Read more >
GROUP BY Clause | ClickHouse Docs
There's an additional way to run aggregation over a table. If a query contains table columns only inside aggregate functions, the GROUP BY...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found