Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fetching large result sets can be sped up massively using S3

See original GitHub issue

Hello, thanks for such a great library; it has really made working with Athena from Python easy.

I download a lot of large results from Athena queries (in the gigabyte range). Unfortunately, using PyAthena for this is very (very) slow – hundreds of times slower than just downloading the results from S3. This is because it is fetching and converting a few rows at a time via the Athena API.

I have taken to working around it with the following approach. Basically, execute the query, get the output location, then fetch with S3 (and convert to Pandas in my case).

from smart_open import smart_open  # pandas can't easily read from S3 using a role/profile

athena_cursor = pyathena.connect(...).cursor()

def query(sql, profile=None):
    """:Return: a Pandas DataFrame of results from a `sql` query executed against AWS Athena."""
    athena_cursor.execute(sql)
    # MUCH faster than PyAthena reading a few rows at a time via the API
    return pd.read_csv(smart_open(athena_cursor.output_location, profile_name=profile))

(I’m using smart_open to make fetching from S3 easy, but of course the same thing can be accomplished with just boto.)

Could something like this be incorporated directly into PyAthena? Perhaps connect could have an option fetch_with_s3=True or similar. You’d probably still need to do type conversion, but there could be a fast path for as_pandas() to let Pandas do it all in one go.

Just a thought; thanks again!

Issue Analytics

State:
Created 5 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

laughingman7743commented, Sep 22, 2018

I checked the performance. https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba PandasCursor is ultra fast. 😆

2reactions

jkleintcommented, Sep 22, 2018

Yeah, it’s working very well! Thank you so much, Will test a bit more next week.

Top Results From Across the Web

How I Improved Performance Retrieving Big Data With S3-Select

I've recently come across a feature in S3 that is particularly useful when working with Big Data. You can write a simple SQL...

Best practices design patterns: optimizing Amazon S3 ...

Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests ...

Indexing Amazon S3 for Real-Time Analytics on Data Lakes

We explore how indexing Amazon S3 data can enable low-latency, high-concurrency queries for real-time analytics.

S3P - Massively Parallel S3 Copying - GenUI

S3P is an essential tool for working with large S3 buckets. The ability to scan buckets 15x faster and copy data 100x faster...

How to Implement Real-Time Streaming Data to S3?

Method 2: Streaming Data to S3 using Amazon Kinesis ... Using Flume, you can seamlessly transport massive quantities of your data from many ......