Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive table scan slow to schedule

See original GitHub issue

Recently upgrade from 0.198 to 0.234.2 and am finding that a seemingly simple Hive query is running 10x slower.

Query is like:

select count(*) from hive.schema.table where date between '2020-05-01' and '2020-06-01';

The table is partitioned by date and there is one date per day so this is around 30 partitions.

Ran this on a 3 node cluster with Presto 0.234.2 and on a 3 node cluster with Presto 0.198. Same instance types.

Query stats for Presto 0.198 are:

Query 20200708_192220_00049_saeib, FINISHED, 3 nodes
Splits: 1,507 total, 1,507 done (100.00%)
0:04 [6.18B rows, 174MB] [1.5B rows/s, 42.3MB/s]

Query stats for Presto 0.234.2 are:

Query 20200708_192252_00054_tqsbg, FINISHED, 3 nodes
Splits: 1,510 total, 1,510 done (100.00%)
1:16 [6.18B rows, 214MB] [81.1M rows/s, 2.81MB/s]

As you can see, the speeds are significantly different. Any thoughts on how to debug?

Some more stats: I also ran an explain analyze for each cluster query: 0.198:

"Fragment 1 [SINGLE]
    CPU: 728.77ms, Input: 1488 rows (13.08kB); per task: avg.: 1488.00 std.dev.: 0.00, Output: 1 row (9B)
    Output layout: [count]
    Output partitioning: SINGLE []
    Execution Flow: UNGROUPED_EXECUTION
    - Aggregate(FINAL) => [count:bigint]
            CPU fraction: 6.90%, Output: 1 row (9B)
            Input avg.: 1488.00 rows, Input std.dev.: 0.00%
            count := ""count""(""count_3"")
        - LocalExchange[SINGLE] () => count_3:bigint
                CPU fraction: 58.62%, Output: 1488 rows (13.08kB)
                Input avg.: 93.00 rows, Input std.dev.: 191.74%
            - RemoteSource[2] => [count_3:bigint]
                    CPU fraction: 34.48%, Output: 1488 rows (13.08kB)
                    Input avg.: 93.00 rows, Input std.dev.: 191.74%

Fragment 2 [SOURCE]
    CPU: 2.82m, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 28273112.95, Output: 1488 rows (13.09kB)
    Output layout: [count_3]
    Output partitioning: SINGLE []
    Execution Flow: UNGROUPED_EXECUTION
    - Aggregate(PARTIAL) => [count_3:bigint]
            Cost: {rows: ? (?), cpu: ?, memory: ?, network: 0.00}
            CPU fraction: 5.22%, Output: 1488 rows (13.09kB)
            Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
            count_3 := ""count""(*)
        - TableScan[hive:bi:unprod_bids_metrics, originalConstraint = (""date"" BETWEEN CAST('2020-05-01' AS varchar) AND CAST('2020-06-01' AS varchar))] => []
                Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                CPU fraction: 94.78%, Output: 6175868099 rows (0B)
                Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
                LAYOUT: bi.unprod_bids_metrics
                HiveColumnHandle{name=date, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY}
                    :: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]

"

0.234.2:

"Fragment 1 [SINGLE]
    CPU: 705.57ms, Scheduled: 16.81s, Input: 1490 rows (13.10kB); per task: avg.: 1490.00 std.dev.: 0.00, Output: 1 row (9B)
    Output layout: [count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    - Aggregate(FINAL) => [count:bigint]
            CPU: 83.00ms (0.02%), Scheduled: 2.50s (0.02%), Output: 1 row (9B)
            Input avg.: 1490.00 rows, Input std.dev.: 0.00%
            count := ""presto.default.count""((count_3))
        - LocalExchange[SINGLE] () => [count_3:bigint]
                CPU: 89.00ms (0.02%), Scheduled: 2.20s (0.02%), Output: 1490 rows (13.10kB)
                Input avg.: 93.13 rows, Input std.dev.: 222.07%
            - RemoteSource[2] => [count_3:bigint]
                    CPU: 62.00ms (0.01%), Scheduled: 1.03s (0.01%), Output: 1490 rows (13.10kB)
                    Input avg.: 93.13 rows, Input std.dev.: 222.07%

Fragment 2 [SOURCE]
    CPU: 7.78m, Scheduled: 3.97h, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 98967421.99, Output: 1490 rows (13.09kB)
    Output layout: [count_3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    - Aggregate(PARTIAL) => [count_3:bigint]
            CPU: 4.55m (58.54%), Scheduled: 2.53h (63.24%), Output: 1490 rows (13.09kB)
            Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
            count_3 := ""presto.default.count""(*)
        - TableScan[TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=bi, tableName=unprod_bids_metrics, analyzePartitionValues=Optional.empty}', layout='Optional[bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}]'}, grouped = false] => []
                CPU: 3.22m (41.41%), Scheduled: 1.47h (36.72%), Output: 6175868099 rows (0B)
                Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
                LAYOUT: bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}
                date:string:-13:PARTITION_KEY
                    :: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]
                Input: 6175868099 rows (0B), Filtered: 0.00%

"

And screenshot of the long running scheduling part of the table scan. Once the scheduling completes, the query completes quickly.

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

mbasmanovacommented, Jul 10, 2020

@benrifkind Ben, I’m glad we were able to help you resolve this issue.

1reaction

benrifkindcommented, Jul 10, 2020

🤦 Upgrading to 0.234.3 fixed this issue for me. Wish I had tried this before posting an issue.

@mbasmanova Thank you very much for your help!

Top Results From Across the Web

Slow SQL Access using Lookup Table through Hive an...

Hi, We have setup a Phoenix table which we access through an external Hive table using the org.apache.phoenix.hive.PhoenixStorageHandler.

Tuning Hive Queries That Uses Underlying HBase Table

Lots of questions!, I'll try to answer all and give you a few performance tips: The data is not copied to the HDFS,...

Performance Tuning Techniques of Hive Big Data Table - InfoQ

In this article, author Sudhish Koloth discusses how to tackle performance problems when using Hive Big Data tables.

5 Tips for efficient Hive queries with Hive Query Language

Tip 1: Partitioning Hive Tables Hive is a powerful tool to perform ... and it is particularly good at queries that require full...

7 Best Hive Optimization Techniques - Hive Performance

There are several types of Hive Query Optimization techniques are available while running our hive queries to improve Hive performance with some Hive...