Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet() - Error - Couldn't deserialize thrift

See original GitHub issue

Hello,

I upgraded to wrangler v1.9.0 and have been receiving errors when trying to read in some AWS DMS files. So far, the below behavior has only happened on some (not all) of the full/initial load files, which are normally larger files. I had to reinitialize 5 tables with my DMS task and 3 of the 5 have this problem with the full load files. I tried to read these problematic files in with pandas, as well as, wrangler v1.8.1, and they all worked.

Try with v.18.1, works.

!pip install awswrangler==1.8.1
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Success
cols = wr.s3.read_parquet_metadata(key) #Success

Try with v1.9.0, fails.

!pip install awswrangler==1.9.0
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Fails - Errors below
cols = wr.s3.read_parquet_metadata(key) #Success

All of the files are able to read in with pandas without issue as well.

import pandas as pd
key = 's3://bucket/path/file.parquet'
df = pd.read_parquet(key, engine='pyarrow') #Success

Errors I’ve gotten from various files:

Couldn’t deserialize thrift: TProtocolException: Invalid data Deserializing page header failed.
Couldn’t deserialize thrift: No more data to read. Deserializing page header failed.
Couldn’t deserialize thrift: don’t know what type: Deserializing page header failed.

Any ideas? Thanks for your help and time!

Jarret

Issue Analytics

State:
Created 3 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

2reactions

eefrocommented, Mar 20, 2022

I had a similar issue and reading the parquet file with fastparquet engine solved the issue. df = pd.read_parquet(‘data.parquet’, engine=‘fastparquet’)

1reaction

jarretgcommented, Sep 3, 2020

Hello Igor,

Yes, the files from the 3 of the 5 tables have sizes from 79MB to 256MB, and are the largest since I implemented v1.9.0. The other 2 tables are under 200KB for the files that successfully were read.

Running from a Sagemaker notebook, here is what is returned. Is this sufficient?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-66bc46c2e087> in <module>
----> 1 df_aws = wr.s3.read_parquet(key, safe=False)

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in read_parquet(path, path_suffix, path_ignore_suffix, partition_filter, columns, validate_schema, chunked, dataset, categories, safe, use_threads, last_modified_begin, last_modified_end, boto3_session, s3_additional_kwargs)
    596         return _read_parquet_chunked(paths=paths, chunked=chunked, validate_schema=validate_schema, **args)
    597     if len(paths) == 1:
--> 598         return _read_parquet(path=paths[0], **args)
    599     if validate_schema is True:
    600         _validate_schemas_from_files(

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet(path, columns, categories, safe, boto3_session, dataset, path_root, s3_additional_kwargs, use_threads)
    413                     itertools.repeat(_utils.boto3_to_primitives(boto3_session=boto3_session)),
    414                     itertools.repeat(s3_additional_kwargs),
--> 415                     itertools.repeat(use_threads),
    416                 )
    417             )

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet_row_group(row_group, path, columns, categories, boto3_primitives, s3_additional_kwargs, use_threads)
    371         num_row_groups: int = pq_file.num_row_groups
    372         _logger.debug("Reading Row Group %s/%s [multi-threaded]", row_group + 1, num_row_groups)
--> 373         return pq_file.read_row_group(i=row_group, columns=columns, use_threads=False, use_pandas_metadata=False)
    374 
    375 

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/parquet.py in read_row_group(self, i, columns, use_threads, use_pandas_metadata)
    269             columns, use_pandas_metadata=use_pandas_metadata)
    270         return self.reader.read_row_group(i, column_indices=column_indices,
--> 271                                           use_threads=use_threads)
    272 
    273     def read_row_groups(self, row_groups, columns=None, use_threads=True,

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_row_group()

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_row_groups()

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Couldn't deserialize thrift: don't know what type: 
Deserializing page header failed.