question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Query performance 1 minute in Hue Impala vs 2 hours in python fetchall()

See original GitHub issue

In Hue Impala my query runs less than 1 minute, but (exactly) the same query using impyla runs more than 2 hours. The python script runs on the same machine where the Impala daemon runs. My query is a simple “SELECT * FROM my_table WHERE col1 = x;” . The data is (Parquet) partitioned by “col1”.

The performance of the cluster is getting high only on the first half minute after Python script starts. The size of my query result is at about 1GB but the memory usage of my Python script increases continuously from some hundred MB until at about 15GB.

In the Python code cursor.execute(sql_query) finishes in less than 20 seconds (sql_query is the above query), but res = cursor.fetchall() runs for ~2 hours.

Does anybody know what the problem could be? Thank you!

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:5
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
schaffinocommented, Nov 8, 2016

@hikoz identified issue with NULL values I inadvertently caused due to indexing from the back of the null bit array and some additional padding that is added by the fix for

HUE-2722, HiveServer2 sometimes does not add trailing

Have just changed to index from the front to avoid the issue and submitted above pull request.

The query below previously returning incorrectly now works.

select stack(2, NULL, NULL, NULL,1.1,‘1’,1) as (a,b,c)

0reactions
schaffinocommented, Nov 3, 2016

I’m not particularly familiar with anaconda but I would try a conda uninstall and then pip uninstall of impyla as it can remove leftovers from mixes of manual installs alongside packaged installs. yOu can also modify setup.py to have a version you can easily identify with pip.

In terms of your memory issues, sounds like the date objects will be taking a fair amount of space. Have you tried the convert_types=False option when creating the cursor?

You can download guppy and run the below after pulling the dataset to see what object types are taking the most memory.

from guppy import hpy
h = hpy()
print h.heap()

Native python object types and data structures aren’t particularly memory efficient. What are you wanting to do with the whole dataset in memory? You could look at going directly a more compact in memory format.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Impala ODBC/JDBC bad performance - rows fetch is v...
Hi, In NameNode when I run the query via odbc script (php/perl or python), I can fetchAll results (9.2M) in a variable in...
Read more >
Python and Impala — Quick Overview and Samples - SoftKraft
It is an open-source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL...
Read more >
EXPLAIN Plans and Query Profiles - Apache Impala
Understanding Impala Query Performance - EXPLAIN Plans and Query Profiles. To understand the high-level performance considerations for Impala queries, read the ...
Read more >
Managing Apache Impala - Cloudera Documentation
Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera ... GLOG_v=1 for most cases: this level has minimal performance ...
Read more >
Access tables from Impala through Python - Stack Overflow
you can use pyhive to make connection to hive and get access to your hive tables. from pyhive import hive import pandas as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found