Hive table scan slow to schedule
See original GitHub issueRecently upgrade from 0.198 to 0.234.2 and am finding that a seemingly simple Hive query is running 10x slower.
Query is like:
select count(*) from hive.schema.table where date between '2020-05-01' and '2020-06-01';
The table is partitioned by date
and there is one date per day so this is around 30 partitions.
Ran this on a 3 node cluster with Presto 0.234.2 and on a 3 node cluster with Presto 0.198. Same instance types.
Query stats for Presto 0.198 are:
Query 20200708_192220_00049_saeib, FINISHED, 3 nodes
Splits: 1,507 total, 1,507 done (100.00%)
0:04 [6.18B rows, 174MB] [1.5B rows/s, 42.3MB/s]
Query stats for Presto 0.234.2 are:
Query 20200708_192252_00054_tqsbg, FINISHED, 3 nodes
Splits: 1,510 total, 1,510 done (100.00%)
1:16 [6.18B rows, 214MB] [81.1M rows/s, 2.81MB/s]
As you can see, the speeds are significantly different. Any thoughts on how to debug?
Some more stats: I also ran an explain analyze for each cluster query: 0.198:
"Fragment 1 [SINGLE]
CPU: 728.77ms, Input: 1488 rows (13.08kB); per task: avg.: 1488.00 std.dev.: 0.00, Output: 1 row (9B)
Output layout: [count]
Output partitioning: SINGLE []
Execution Flow: UNGROUPED_EXECUTION
- Aggregate(FINAL) => [count:bigint]
CPU fraction: 6.90%, Output: 1 row (9B)
Input avg.: 1488.00 rows, Input std.dev.: 0.00%
count := ""count""(""count_3"")
- LocalExchange[SINGLE] () => count_3:bigint
CPU fraction: 58.62%, Output: 1488 rows (13.08kB)
Input avg.: 93.00 rows, Input std.dev.: 191.74%
- RemoteSource[2] => [count_3:bigint]
CPU fraction: 34.48%, Output: 1488 rows (13.08kB)
Input avg.: 93.00 rows, Input std.dev.: 191.74%
Fragment 2 [SOURCE]
CPU: 2.82m, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 28273112.95, Output: 1488 rows (13.09kB)
Output layout: [count_3]
Output partitioning: SINGLE []
Execution Flow: UNGROUPED_EXECUTION
- Aggregate(PARTIAL) => [count_3:bigint]
Cost: {rows: ? (?), cpu: ?, memory: ?, network: 0.00}
CPU fraction: 5.22%, Output: 1488 rows (13.09kB)
Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
count_3 := ""count""(*)
- TableScan[hive:bi:unprod_bids_metrics, originalConstraint = (""date"" BETWEEN CAST('2020-05-01' AS varchar) AND CAST('2020-06-01' AS varchar))] => []
Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
CPU fraction: 94.78%, Output: 6175868099 rows (0B)
Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
LAYOUT: bi.unprod_bids_metrics
HiveColumnHandle{name=date, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY}
:: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]
"
0.234.2:
"Fragment 1 [SINGLE]
CPU: 705.57ms, Scheduled: 16.81s, Input: 1490 rows (13.10kB); per task: avg.: 1490.00 std.dev.: 0.00, Output: 1 row (9B)
Output layout: [count]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
- Aggregate(FINAL) => [count:bigint]
CPU: 83.00ms (0.02%), Scheduled: 2.50s (0.02%), Output: 1 row (9B)
Input avg.: 1490.00 rows, Input std.dev.: 0.00%
count := ""presto.default.count""((count_3))
- LocalExchange[SINGLE] () => [count_3:bigint]
CPU: 89.00ms (0.02%), Scheduled: 2.20s (0.02%), Output: 1490 rows (13.10kB)
Input avg.: 93.13 rows, Input std.dev.: 222.07%
- RemoteSource[2] => [count_3:bigint]
CPU: 62.00ms (0.01%), Scheduled: 1.03s (0.01%), Output: 1490 rows (13.10kB)
Input avg.: 93.13 rows, Input std.dev.: 222.07%
Fragment 2 [SOURCE]
CPU: 7.78m, Scheduled: 3.97h, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 98967421.99, Output: 1490 rows (13.09kB)
Output layout: [count_3]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
- Aggregate(PARTIAL) => [count_3:bigint]
CPU: 4.55m (58.54%), Scheduled: 2.53h (63.24%), Output: 1490 rows (13.09kB)
Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
count_3 := ""presto.default.count""(*)
- TableScan[TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=bi, tableName=unprod_bids_metrics, analyzePartitionValues=Optional.empty}', layout='Optional[bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}]'}, grouped = false] => []
CPU: 3.22m (41.41%), Scheduled: 1.47h (36.72%), Output: 6175868099 rows (0B)
Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
LAYOUT: bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}
date:string:-13:PARTITION_KEY
:: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]
Input: 6175868099 rows (0B), Filtered: 0.00%
"
And screenshot of the long running scheduling part of the table scan. Once the scheduling completes, the query completes quickly.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Slow SQL Access using Lookup Table through Hive an...
Hi, We have setup a Phoenix table which we access through an external Hive table using the org.apache.phoenix.hive.PhoenixStorageHandler.
Read more >Tuning Hive Queries That Uses Underlying HBase Table
Lots of questions!, I'll try to answer all and give you a few performance tips: The data is not copied to the HDFS,...
Read more >Performance Tuning Techniques of Hive Big Data Table - InfoQ
In this article, author Sudhish Koloth discusses how to tackle performance problems when using Hive Big Data tables.
Read more >5 Tips for efficient Hive queries with Hive Query Language
Tip 1: Partitioning Hive Tables Hive is a powerful tool to perform ... and it is particularly good at queries that require full...
Read more >7 Best Hive Optimization Techniques - Hive Performance
There are several types of Hive Query Optimization techniques are available while running our hive queries to improve Hive performance with some Hive...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@benrifkind Ben, I’m glad we were able to help you resolve this issue.
🤦 Upgrading to 0.234.3 fixed this issue for me. Wish I had tried this before posting an issue.
@mbasmanova Thank you very much for your help!