question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_parquet() - Error - Couldn't deserialize thrift

See original GitHub issue

Hello,

I upgraded to wrangler v1.9.0 and have been receiving errors when trying to read in some AWS DMS files. So far, the below behavior has only happened on some (not all) of the full/initial load files, which are normally larger files. I had to reinitialize 5 tables with my DMS task and 3 of the 5 have this problem with the full load files. I tried to read these problematic files in with pandas, as well as, wrangler v1.8.1, and they all worked.

Try with v.18.1, works.

!pip install awswrangler==1.8.1
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Success
cols = wr.s3.read_parquet_metadata(key) #Success

Try with v1.9.0, fails.

!pip install awswrangler==1.9.0
# Restart kernel
import awswrangler as wr
key = 's3://bucket/path/file.parquet'
df = wr.s3.read_parquet(key, safe=False) #Fails - Errors below
cols = wr.s3.read_parquet_metadata(key) #Success

All of the files are able to read in with pandas without issue as well.

import pandas as pd
key = 's3://bucket/path/file.parquet'
df = pd.read_parquet(key, engine='pyarrow') #Success

Errors I’ve gotten from various files:

  • Couldn’t deserialize thrift: TProtocolException: Invalid data Deserializing page header failed.

  • Couldn’t deserialize thrift: No more data to read. Deserializing page header failed.

  • Couldn’t deserialize thrift: don’t know what type: Deserializing page header failed.

Any ideas? Thanks for your help and time!

Jarret

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
eefrocommented, Mar 20, 2022

I had a similar issue and reading the parquet file with fastparquet engine solved the issue. df = pd.read_parquet(‘data.parquet’, engine=‘fastparquet’)

1reaction
jarretgcommented, Sep 3, 2020

Hello Igor,

Yes, the files from the 3 of the 5 tables have sizes from 79MB to 256MB, and are the largest since I implemented v1.9.0. The other 2 tables are under 200KB for the files that successfully were read.

Running from a Sagemaker notebook, here is what is returned. Is this sufficient?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-66bc46c2e087> in <module>
----> 1 df_aws = wr.s3.read_parquet(key, safe=False)

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in read_parquet(path, path_suffix, path_ignore_suffix, partition_filter, columns, validate_schema, chunked, dataset, categories, safe, use_threads, last_modified_begin, last_modified_end, boto3_session, s3_additional_kwargs)
    596         return _read_parquet_chunked(paths=paths, chunked=chunked, validate_schema=validate_schema, **args)
    597     if len(paths) == 1:
--> 598         return _read_parquet(path=paths[0], **args)
    599     if validate_schema is True:
    600         _validate_schemas_from_files(

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet(path, columns, categories, safe, boto3_session, dataset, path_root, s3_additional_kwargs, use_threads)
    413                     itertools.repeat(_utils.boto3_to_primitives(boto3_session=boto3_session)),
    414                     itertools.repeat(s3_additional_kwargs),
--> 415                     itertools.repeat(use_threads),
    416                 )
    417             )

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.monotonic())

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/anaconda3/envs/python3/lib/python3.6/concurrent/futures/thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~/anaconda3/envs/python3/lib/python3.6/site-packages/awswrangler/s3/_read_parquet.py in _read_parquet_row_group(row_group, path, columns, categories, boto3_primitives, s3_additional_kwargs, use_threads)
    371         num_row_groups: int = pq_file.num_row_groups
    372         _logger.debug("Reading Row Group %s/%s [multi-threaded]", row_group + 1, num_row_groups)
--> 373         return pq_file.read_row_group(i=row_group, columns=columns, use_threads=False, use_pandas_metadata=False)
    374 
    375 

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/parquet.py in read_row_group(self, i, columns, use_threads, use_pandas_metadata)
    269             columns, use_pandas_metadata=use_pandas_metadata)
    270         return self.reader.read_row_group(i, column_indices=column_indices,
--> 271                                           use_threads=use_threads)
    272 
    273     def read_row_groups(self, row_groups, columns=None, use_threads=True,

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_row_group()

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_row_groups()

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Couldn't deserialize thrift: don't know what type: 
Deserializing page header failed.
Read more comments on GitHub >

github_iconTop Results From Across the Web

read_parquet() - Error - Couldn't deserialize thrift · Issue #376
Hello, I upgraded to wrangler v1.9.0 and have been receiving errors when trying to read in some AWS DMS files. So far, the...
Read more >
Dask Dataframe from parquet files: OSError: Couldn't ...
I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline...
Read more >
[C++] 'Couldn't deserialize thrift' error when reading large ...
We've run into issues reading Parquet files that contain long binary columns (utf8 strings). In particular, we were generating WKT ...
Read more >
ReadParquet error - Google Groups
I'm running into an error when I run the imputation step. ... OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit.
Read more >
Parquet errors (from other thread but am blocked from posting ...
parquet as pq pq.read_table(parquet_path) ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data Deserializing page header ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found