question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OverflowError: signed integer is greater than maximum on large Pandas Queries

See original GitHub issue

Hello — Been using the PandasCursor on Python 3.7 and Pandas 1.0.3. Upgraded to Python 3.8.2 and PyAthena 1.10.5 today and some of my queries began to fail. It appears to be the ones that are on the largest data sets (38M records; 2GB of data in S3).

Here’s the call I’m making:

results = cursor.execute('SELECT col1, col2, col3 FROM table')

Where all three columns are string types in Glue/Athena.

This is the error that comes back:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-18-08c66998c34b> in <module>
----> 1 results = cursor.execute('SELECT col1, col2, col3 FROM table')

/usr/local/lib/python3.8/site-packages/pyathena/util.py in _wrapper(*args, **kwargs)
    240     def _wrapper(*args, **kwargs):
    241         with _lock:
--> 242             return wrapped(*args, **kwargs)
    243     return _wrapper
    244 

/usr/local/lib/python3.8/site-packages/pyathena/pandas_cursor.py in execute(self, operation, parameters, work_group, s3_staging_dir, cache_size)
     53         query_execution = self._poll(self._query_id)
     54         if query_execution.state == AthenaQueryExecution.STATE_SUCCEEDED:
---> 55             self._result_set = AthenaPandasResultSet(
     56                 self._connection, self._converter, query_execution, self.arraysize,
     57                 self._retry_config)

/usr/local/lib/python3.8/site-packages/pyathena/result_set.py in __init__(self, connection, converter, query_execution, arraysize, retry_config)
    358         if self.state == AthenaQueryExecution.STATE_SUCCEEDED and \
    359                 self.output_location.endswith(('.csv', '.txt')):
--> 360             self._df = self._as_pandas()
    361         else:
    362             import pandas as pd

/usr/local/lib/python3.8/site-packages/pyathena/result_set.py in _as_pandas(self)
    449                     header = 0
    450                     names = None
--> 451                 df = pd.read_csv(io.BytesIO(response['Body'].read()),
    452                                  sep=sep,
    453                                  header=header,

/usr/local/lib/python3.8/site-packages/botocore/response.py in read(self, amt)
     76         """
     77         try:
---> 78             chunk = self._raw_stream.read(amt)
     79         except URLLib3ReadTimeoutError as e:
     80             # TODO: the url will be None as urllib3 isn't setting it yet

/usr/local/lib/python3.8/site-packages/urllib3/response.py in read(self, amt, decode_content, cache_content)
    513             if amt is None:
    514                 # cStringIO doesn't like amt=None
--> 515                 data = self._fp.read() if not fp_closed else b""
    516                 flush_decoder = True
    517             else:

/usr/local/Cellar/python@3.8/3.8.2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py in read(self, amt)
    465             else:
    466                 try:
--> 467                     s = self._safe_read(self.length)
    468                 except IncompleteRead:
    469                     self._close_conn()

/usr/local/Cellar/python@3.8/3.8.2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py in _safe_read(self, amt)
    606         IncompleteRead exception can be used to detect the problem.
    607         """
--> 608         data = self.fp.read(amt)
    609         if len(data) < amt:
    610             raise IncompleteRead(data, amt-len(data))

/usr/local/Cellar/python@3.8/3.8.2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py in readinto(self, b)
    667         while True:
    668             try:
--> 669                 return self._sock.recv_into(b)
    670             except timeout:
    671                 self._timeout_occurred = True

/usr/local/Cellar/python@3.8/3.8.2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
   1239                   "non-zero flags not allowed in calls to recv_into() on %s" %
   1240                   self.__class__)
-> 1241             return self.read(nbytes, buffer)
   1242         else:
   1243             return super().recv_into(buffer, nbytes, flags)

/usr/local/Cellar/python@3.8/3.8.2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py in read(self, len, buffer)
   1097         try:
   1098             if buffer is not None:
-> 1099                 return self._sslobj.read(len, buffer)
   1100             else:
   1101                 return self._sslobj.read(len)

OverflowError: signed integer is greater than maximum

This is my first time reporting something like this, so please let me know what other information (and how to gather it if it isn’t obvious) you would need to diagnose the issue. Thank you!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
KneeShardcommented, May 12, 2020

I have recently encountered this behaviour, and I think that there’s a simple workaround in result_set.py which makes pandas read the response in chunks rather than all in one go:

diff --git a/pyathena/result_set.py b/pyathena/result_set.py
index b50a358..8dcf022 100644
--- a/pyathena/result_set.py
+++ b/pyathena/result_set.py
@@ -529,7 +529,7 @@ class AthenaPandasResultSet(AthenaResultSet):
                     header = 0
                     names = None
                 df = pd.read_csv(
-                    io.BytesIO(response["Body"].read()),
+                    response["Body"],
                     sep=sep,
                     header=header,
                     names=names,

Python 3.8.2 (64 bit) on Linux, PyAthena==1.10.5, pandas==1.0.3

0reactions
juviasuiseicommented, May 13, 2020

ty!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Overflowerror when reading from s3 - signed integer is greater ...
If I'm reading from S3 and trying to load the file into Lambda's memory (which is sufficiently large enough to hold the data),...
Read more >
`OverflowError: signed integer is greater than maximum` in ssl ...
When attempting to read a large file (> 2GB) over HTTPS the read fails with "OverflowError: signed integer is greater than maximum".
Read more >
OverflowError: signed integer is greater than maximum
OverflowError : signed integer is greater than maximum. Believe it should be problem with the 64 bits OS. Tried python 2.6.1 and 2.6.6...
Read more >
Pandas dtype issue: converting number to str - signed integer ...
df_web = pd.read_csv('web_oh.csv',dtype=str) traceback: OverflowError: signed integer is greater than maximum. example data:
Read more >
12372 (OverflowError: signed integer is greater than maximum)
Branch: Release Notes: Fix parse_date() raising OverflowError for large integer part. API Changes:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found