Fetching large result sets can be sped up massively using S3
See original GitHub issueHello, thanks for such a great library; it has really made working with Athena from Python easy.
I download a lot of large results from Athena queries (in the gigabyte range). Unfortunately, using PyAthena for this is very (very) slow – hundreds of times slower than just downloading the results from S3. This is because it is fetching and converting a few rows at a time via the Athena API.
I have taken to working around it with the following approach. Basically, execute the query, get the output location, then fetch with S3 (and convert to Pandas in my case).
from smart_open import smart_open # pandas can't easily read from S3 using a role/profile
athena_cursor = pyathena.connect(...).cursor()
def query(sql, profile=None):
""":Return: a Pandas DataFrame of results from a `sql` query executed against AWS Athena."""
athena_cursor.execute(sql)
# MUCH faster than PyAthena reading a few rows at a time via the API
return pd.read_csv(smart_open(athena_cursor.output_location, profile_name=profile))
(I’m using smart_open to make fetching from S3 easy, but of course the same thing can be accomplished with just boto.)
Could something like this be incorporated directly into PyAthena? Perhaps connect
could have an option fetch_with_s3=True
or similar. You’d probably still need to do type conversion, but there could be a fast path for as_pandas()
to let Pandas do it all in one go.
Just a thought; thanks again!
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
I checked the performance. https://gist.github.com/laughingman7743/2e4d83ca4e394dc645e9ea9a45fe78ba PandasCursor is ultra fast. 😆
Yeah, it’s working very well! Thank you so much, Will test a bit more next week.