Extra remote exchange when window partitions is a super set of bucket columns
See original GitHub issuepresto:tiny> CREATE TABLE test_bucket (
-> bucket_key INT,
-> foo VARCHAR,
-> bar VARCHAR,
-> partition_key VARCHAR)
-> WITH (
-> partitioned_by=ARRAY['partition_key'],
-> bucketed_by=ARRAY['bucket_key'],
-> bucket_count=16);
CREATE TABLE
presto:tiny> INSERT INTO test_bucket VALUES (1, 'foo', 'bar', '2018-12-18');
INSERT: 1 row
Query 20181219_202234_58401_d46p5, FINISHED, 18 nodes
Splits: 22 total, 22 done (100.00%)
0:06 [0 rows, 0B] [0 rows/s, 0B/s]
presto:tiny> EXPLAIN (TYPE DISTRIBUTED)
-> SELECT rank() OVER (PARTITION BY bucket_key)
-> FROM test_bucket;
Query Plan
-----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [rank]
Output partitioning: SINGLE []
Grouped Execution: false
- Output[_col0] => [rank:bigint]
_col0 := rank
- RemoteSource[1] => [rank:bigint]
Fragment 1 [HASH]
Output layout: [rank]
Output partitioning: SINGLE []
Grouped Execution: false
- Project[] => [rank:bigint]
- Window[partition by (bucket_key)][$hashvalue] => [bucket_key:integer, $hashvalue:bigint, rank:bigint]
rank := rank() RANGE UNBOUNDED_PRECEDING CURRENT_ROW
- LocalExchange[HASH][$hashvalue] ("bucket_key") => bucket_key:integer, $hashvalue:bigint
- RemoteSource[2] => [bucket_key:integer, $hashvalue_4:bigint]
Fragment 2 [prism:buckets=16, hiveTypes=[int]]
Output layout: [bucket_key, $hashvalue_5]
Output partitioning: HASH [bucket_key][$hashvalue_5]
Grouped Execution: false
- ScanProject[table = prism:di:test_bucket, grouped = false] => [bucket_key:integer, $hashvalue_5:bigint]
$hashvalue_5 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("bucket_key"), 0))
LAYOUT: di.test_bucket bucket=16
bucket_key := bucket_key:int:0:REGULAR
partition_key:string:-1:PARTITION_KEY
:: [[2018-12-18]]
(1 row)
Query 20181219_202654_59652_d46p5, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Looks to me if window partition (e.g. [‘a’, ‘b’, ‘c’]) is a super set of bucket columns (e.g. [‘a’, ‘b’]), we don’t need to do a remote exchange, since the required rows are available locally. In this way fragment 1 and 2 could be combined into one fragment.
Let me know if I’m missing some obvious/essential part. If it looks good, I could work on this optimization.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Window functions - Amazon Redshift - AWS Documentation
Window functions operate on a partition or "window" of a result set, and return a value for every row in that window. In...
Read more >Get the best out of Oracle Partitioning - CERN Indico
Plan for best time window ... Support of add partition, drop partition, exchange partition ... Others like DROP/RENAME/SET UNUSED COLUMN are forbidden.
Read more >Best Practices for Bucketing in Spark SQL | by David Vrba
The first argument of the bucketBy is the number of buckets that should be created. ... spark.conf.set("spark.sql.shuffle.partitions", n).
Read more >How to use Window functions in SQL Server - SQLShack
Window functions operate on a set of rows and return a single aggregated ... The target column or expression that the functions operates...
Read more >Bucketing 2.0: Improve Spark SQL Performance by Removing ...
In Spark SQL we use the exchange operator for shuffle, so in this picture we can say ... the join key set is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@findepi: It is the same for unpartitioned table.
@sopel39: I’ve checked that
hive.bucket_execution_enabled
andplan_with_table_node_partitioning
are both set totrue
. ForGROUP BY
, there won’t be additional remote exchange between aggregate and table scan.There is no
ConnectorTableLayout#getNodePartitioning
so I’m a bit confused.ActualProperties#getNodePartitioning
is not empty, andConnectorTableLayout#getTablePartitioning
is not empty either. Hope I’m looking at the right method.Looks like the remote exchange is added here: https://github.com/prestodb/presto/blob/7dfbe1be3c61455047827335c5dadd94125e0c48/presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java#L309-L311
, after we checked
isStreamPartitionedOn
. Yet invisitAggregation()
https://github.com/prestodb/presto/blob/7dfbe1be3c61455047827335c5dadd94125e0c48/presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/AddExchanges.java#L232, I do see that we also checkedisNodePartitionedOn()
in addition toisStreamPartitionOn()
.For window function, if
PARTITION BY
columns include columns the node is partitioned on, I think a local exchange should suffice?in your example:
i think you need to repartition after table scan to combine rows with the same
bucket_key
but from different partitions (unless we realize there is exactly one partition)@shixuan-fan does the same happen when the table is not partitioned?