question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fetching large result sets can be sped up massively using S3

See original GitHub issue

Hello, thanks for such a great library; it has really made working with Athena from Python easy.

I download a lot of large results from Athena queries (in the gigabyte range). Unfortunately, using PyAthena for this is very (very) slow – hundreds of times slower than just downloading the results from S3. This is because it is fetching and converting a few rows at a time via the Athena API.

I have taken to working around it with the following approach. Basically, execute the query, get the output location, then fetch with S3 (and convert to Pandas in my case).

from smart_open import smart_open  # pandas can't easily read from S3 using a role/profile

athena_cursor = pyathena.connect(...).cursor()

def query(sql, profile=None):
    """:Return: a Pandas DataFrame of results from a `sql` query executed against AWS Athena."""
    athena_cursor.execute(sql)
    # MUCH faster than PyAthena reading a few rows at a time via the API
    return pd.read_csv(smart_open(athena_cursor.output_location, profile_name=profile))

(I’m using smart_open to make fetching from S3 easy, but of course the same thing can be accomplished with just boto.)

Could something like this be incorporated directly into PyAthena? Perhaps connect could have an option fetch_with_s3=True or similar. You’d probably still need to do type conversion, but there could be a fast path for as_pandas() to let Pandas do it all in one go.

Just a thought; thanks again!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
laughingman7743commented, Sep 22, 2018

I checked the performance. https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba PandasCursor is ultra fast. 😆

2reactions
jkleintcommented, Sep 22, 2018

Yeah, it’s working very well! Thank you so much, Will test a bit more next week.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How I Improved Performance Retrieving Big Data With S3-Select
I've recently come across a feature in S3 that is particularly useful when working with Big Data. You can write a simple SQL...
Read more >
Best practices design patterns: optimizing Amazon S3 ...
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests ...
Read more >
Indexing Amazon S3 for Real-Time Analytics on Data Lakes
We explore how indexing Amazon S3 data can enable low-latency, high-concurrency queries for real-time analytics.
Read more >
S3P - Massively Parallel S3 Copying - GenUI
S3P is an essential tool for working with large S3 buckets. The ability to scan buckets 15x faster and copy data 100x faster...
Read more >
How to Implement Real-Time Streaming Data to S3?
Method 2: Streaming Data to S3 using Amazon Kinesis ... Using Flume, you can seamlessly transport massive quantities of your data from many ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found