Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when using with pd.read_parquet with threading on

See original GitHub issue

Since updating to latest version I randomly get errors like this:

Traceback (most recent call last):
  File "print_parquet_metadata.py", line 28, in <module>
    df = pd.read_parquet(args.file, use_threads=True)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 282, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 129, in read
    **kwargs).to_pandas()
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 1216, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (26292) than expected (59582)

I’m pretty sure that the file is not corrupted because sometimes it can be read successfully, and it can also be read when the file is local.

It is also very hard to reproduce consistently but if I try to read the file in a loop, I realize a pattern:

This will succeed 30 times in a row without a problem

f = "s3://<bucket>/<parquet_file>"
for i in range(30):
    pd.read_parquet(f, use_threads=False)

This will fail sometimes on the first iteration, sometimes on the second, and never got to iteration number 5:

f = "s3://<bucket>/<parquet_file>"
for i in range(30):
    pd.read_parquet(f, use_threads=True)

I suspect that it has something to do with random access with multiple threads, and the cache is not handling it correctly.

Finally this is the debug log which I hope will help

DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): mybucket.s3.us-west-2.amazonaws.com
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /?list-type=2&prefix=&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /?list-type=2&prefix=temp%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 18606060 - 18671596
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 18606060-23914476
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 65536
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 18176581 - 18670182
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 18176581-18606060
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 429479
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 4 - 7237571
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 4-12480451
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 12480447
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 8274755 - 18176458
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 8274755-10485764
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 2211009
DEBUG:fsspec:<File-like object S3FileSystem, mybucket/temp/data.parquet> read: 7237648 - 8274676
DEBUG:s3fs.core:Fetch: mybucket/temp/data.parquet, 7237648-8274755
DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:urllib3.connectionpool:https://mybucket.s3.us-west-2.amazonaws.com:443 "GET /temp/data.parquet HTTP/1.1" 206 1037107
Traceback (most recent call last):
  File "print_parquet_metadata.py", line 28, in <module>
    df = pd.read_parquet(args.file, use_threads=True)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 282, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pandas/io/parquet.py", line 129, in read
    **kwargs).to_pandas()
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 1216, in read_table
    use_pandas_metadata=use_pandas_metadata)
  File "/Users/mtrbean/.pyenv/versions/stats3/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (26292) than expected (59582)

Issue Analytics

State:
Created 4 years ago
Comments:13 (12 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Sep 9, 2019

Apparently boto3 sessions are not thread-safe, but we have not come across this before, so maybe previously they accidentally were. Certainly, s3fs instances get shared, but this is not new behaviour. I do not have a good idea of how to produce thread-local boto sessions.

1reaction

mtrbeancommented, Aug 7, 2019

I currently work around it by turning off threading in pyarrow, but I think the best way is to turn off caching. What is the best way to turn off fsspec’s caching? @martindurant

Top Results From Across the Web

Error when using with pd.read_parquet with threading on #213

I'm pretty sure that the file is not corrupted because sometimes it can be read successfully, and it can also be read when...

fail to read parquet with pd.read_parquet - Stack Overflow

I've just updated all my conda environments (pandas 1.4. 1) and I'm facing a problem with pandas read_parquet function. I've tried with ...

[jira] [Resolved] (PARQUET-1857) [C++ ... - The Mail Archive

... when using batch size 1 and then read from Python, there is error too: > ``` > >>> pd.read_parquet("some.parquet", engine="pyarrow") ...

pandas.read_table — pandas 1.5.2 documentation

For non-standard datetime parsing, use pd.to_datetime after pd.read_csv . To parse an index or column with a mixture of timezones, specify date_parser ...

awswrangler.s3.read_parquet - Read the Docs

If chunked=True, a new DataFrame will be returned for each file in your path/dataset. If chunked=INTEGER, awswrangler will iterate on the data by...