read_parquet() - Error - Couldn't deserialize thrift
See original GitHub issueHello,
I upgraded to wrangler v1.9.0 and have been receiving errors when trying to read in some AWS DMS files. So far, the below behavior has only happened on some (not all) of the full/initial load files, which are normally larger files. I had to reinitialize 5 tables with my DMS task and 3 of the 5 have this problem with the full load files. I tried to read these problematic files in with pandas, as well as, wrangler v1.8.1, and they all worked.
Try with v.18.1, works.
!pip install awswrangler==1.8.1
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Success
cols = wr.s3.read_parquet_metadata(key) #Success
Try with v1.9.0, fails.
!pip install awswrangler==1.9.0
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Fails - Errors below
cols = wr.s3.read_parquet_metadata(key) #Success
All of the files are able to read in with pandas without issue as well.
import pandas as pd
key = 's3://bucket/path/file.parquet'
df = pd.read_parquet(key, engine='pyarrow') #Success
Errors I’ve gotten from various files:
-
Couldn’t deserialize thrift: TProtocolException: Invalid data Deserializing page header failed.
-
Couldn’t deserialize thrift: No more data to read. Deserializing page header failed.
-
Couldn’t deserialize thrift: don’t know what type: Deserializing page header failed.
Any ideas? Thanks for your help and time!
Jarret
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
I had a similar issue and reading the parquet file with fastparquet engine solved the issue. df = pd.read_parquet(‘data.parquet’, engine=‘fastparquet’)
Hello Igor,
Yes, the files from the 3 of the 5 tables have sizes from 79MB to 256MB, and are the largest since I implemented v1.9.0. The other 2 tables are under 200KB for the files that successfully were read.
Running from a Sagemaker notebook, here is what is returned. Is this sufficient?