question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive table scan slow to schedule

See original GitHub issue

Recently upgrade from 0.198 to 0.234.2 and am finding that a seemingly simple Hive query is running 10x slower.

Query is like:

select count(*) from hive.schema.table where date between '2020-05-01' and '2020-06-01';

The table is partitioned by date and there is one date per day so this is around 30 partitions.

Ran this on a 3 node cluster with Presto 0.234.2 and on a 3 node cluster with Presto 0.198. Same instance types.

Query stats for Presto 0.198 are:

Query 20200708_192220_00049_saeib, FINISHED, 3 nodes
Splits: 1,507 total, 1,507 done (100.00%)
0:04 [6.18B rows, 174MB] [1.5B rows/s, 42.3MB/s]

Query stats for Presto 0.234.2 are:

Query 20200708_192252_00054_tqsbg, FINISHED, 3 nodes
Splits: 1,510 total, 1,510 done (100.00%)
1:16 [6.18B rows, 214MB] [81.1M rows/s, 2.81MB/s]

As you can see, the speeds are significantly different. Any thoughts on how to debug?

Some more stats: I also ran an explain analyze for each cluster query: 0.198:

"Fragment 1 [SINGLE]
    CPU: 728.77ms, Input: 1488 rows (13.08kB); per task: avg.: 1488.00 std.dev.: 0.00, Output: 1 row (9B)
    Output layout: [count]
    Output partitioning: SINGLE []
    Execution Flow: UNGROUPED_EXECUTION
    - Aggregate(FINAL) => [count:bigint]
            CPU fraction: 6.90%, Output: 1 row (9B)
            Input avg.: 1488.00 rows, Input std.dev.: 0.00%
            count := ""count""(""count_3"")
        - LocalExchange[SINGLE] () => count_3:bigint
                CPU fraction: 58.62%, Output: 1488 rows (13.08kB)
                Input avg.: 93.00 rows, Input std.dev.: 191.74%
            - RemoteSource[2] => [count_3:bigint]
                    CPU fraction: 34.48%, Output: 1488 rows (13.08kB)
                    Input avg.: 93.00 rows, Input std.dev.: 191.74%

Fragment 2 [SOURCE]
    CPU: 2.82m, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 28273112.95, Output: 1488 rows (13.09kB)
    Output layout: [count_3]
    Output partitioning: SINGLE []
    Execution Flow: UNGROUPED_EXECUTION
    - Aggregate(PARTIAL) => [count_3:bigint]
            Cost: {rows: ? (?), cpu: ?, memory: ?, network: 0.00}
            CPU fraction: 5.22%, Output: 1488 rows (13.09kB)
            Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
            count_3 := ""count""(*)
        - TableScan[hive:bi:unprod_bids_metrics, originalConstraint = (""date"" BETWEEN CAST('2020-05-01' AS varchar) AND CAST('2020-06-01' AS varchar))] => []
                Cost: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}
                CPU fraction: 94.78%, Output: 6175868099 rows (0B)
                Input avg.: 4150448.99 rows, Input std.dev.: 45.67%
                LAYOUT: bi.unprod_bids_metrics
                HiveColumnHandle{name=date, hiveType=string, hiveColumnIndex=-1, columnType=PARTITION_KEY}
                    :: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]

"

0.234.2:

"Fragment 1 [SINGLE]
    CPU: 705.57ms, Scheduled: 16.81s, Input: 1490 rows (13.10kB); per task: avg.: 1490.00 std.dev.: 0.00, Output: 1 row (9B)
    Output layout: [count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    - Aggregate(FINAL) => [count:bigint]
            CPU: 83.00ms (0.02%), Scheduled: 2.50s (0.02%), Output: 1 row (9B)
            Input avg.: 1490.00 rows, Input std.dev.: 0.00%
            count := ""presto.default.count""((count_3))
        - LocalExchange[SINGLE] () => [count_3:bigint]
                CPU: 89.00ms (0.02%), Scheduled: 2.20s (0.02%), Output: 1490 rows (13.10kB)
                Input avg.: 93.13 rows, Input std.dev.: 222.07%
            - RemoteSource[2] => [count_3:bigint]
                    CPU: 62.00ms (0.01%), Scheduled: 1.03s (0.01%), Output: 1490 rows (13.10kB)
                    Input avg.: 93.13 rows, Input std.dev.: 222.07%

Fragment 2 [SOURCE]
    CPU: 7.78m, Scheduled: 3.97h, Input: 6175868099 rows (0B); per task: avg.: 2058622699.67 std.dev.: 98967421.99, Output: 1490 rows (13.09kB)
    Output layout: [count_3]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    - Aggregate(PARTIAL) => [count_3:bigint]
            CPU: 4.55m (58.54%), Scheduled: 2.53h (63.24%), Output: 1490 rows (13.09kB)
            Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
            count_3 := ""presto.default.count""(*)
        - TableScan[TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=bi, tableName=unprod_bids_metrics, analyzePartitionValues=Optional.empty}', layout='Optional[bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}]'}, grouped = false] => []
                CPU: 3.22m (41.41%), Scheduled: 1.47h (36.72%), Output: 6175868099 rows (0B)
                Input avg.: 4144877.92 rows, Input std.dev.: 45.31%
                LAYOUT: bi.unprod_bids_metrics{domains={date=[ [[2020-05-01, 2020-06-01]] ]}}
                date:string:-13:PARTITION_KEY
                    :: [[2020-05-01], [2020-05-02], [2020-05-03], [2020-05-04], [2020-05-05], [2020-05-06], [2020-05-07], [2020-05-08], [2020-05-09], [2020-05-10], [2020-05-11], [2020-05-12], [2020-05-13], [2020-05-14], [2020-05-15], [2020-05-16], [2020-05-17], [2020-05-18], [2020-05-19], [2020-05-20], [2020-05-21], [2020-05-22], [2020-05-23], [2020-05-24], [2020-05-25], [2020-05-26], [2020-05-27], [2020-05-28], [2020-05-29], [2020-05-30], [2020-05-31], [2020-06-01]]
                Input: 6175868099 rows (0B), Filtered: 0.00%

"

And screenshot of the long running scheduling part of the table scan. Once the scheduling completes, the query completes quickly. image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mbasmanovacommented, Jul 10, 2020

@benrifkind Ben, I’m glad we were able to help you resolve this issue.

1reaction
benrifkindcommented, Jul 10, 2020

🤦 Upgrading to 0.234.3 fixed this issue for me. Wish I had tried this before posting an issue.

@mbasmanova Thank you very much for your help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow SQL Access using Lookup Table through Hive an...
Hi, We have setup a Phoenix table which we access through an external Hive table using the org.apache.phoenix.hive.PhoenixStorageHandler.
Read more >
Tuning Hive Queries That Uses Underlying HBase Table
Lots of questions!, I'll try to answer all and give you a few performance tips: The data is not copied to the HDFS,...
Read more >
Performance Tuning Techniques of Hive Big Data Table - InfoQ
In this article, author Sudhish Koloth discusses how to tackle performance problems when using Hive Big Data tables.
Read more >
5 Tips for efficient Hive queries with Hive Query Language
Tip 1: Partitioning Hive Tables Hive is a powerful tool to perform ... and it is particularly good at queries that require full...
Read more >
7 Best Hive Optimization Techniques - Hive Performance
There are several types of Hive Query Optimization techniques are available while running our hive queries to improve Hive performance with some Hive...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found