OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift
See original GitHub issueI’m having an issue storing a large dataset (around 40GB) in a single parquet file.
I’m using the fastparquet
library to append pandas.DataFrames
to this parquet dataset file, and everything goes fine until the dataset hits 2.3GB, at which point I get the following errors:
OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'
Having debugged my way through the fastparquet
code itself, it seems to me that what is happening internally is that to append rows, it has to update the page headers, and it seems that this is done by creating a Thrift object and writing it to a file:
# fastparquet/writer.py:581
ph = parquet_thrift.PageHeader(type=parquet_thrift.PageType.DATA_PAGE,
uncompressed_page_size=l0,
compressed_page_size=l1,
data_page_header=dph, i32=1)
The problem could be that the uncompressed_page_size
attribute is typed as int32
, and so as the file grows, when it reaches that limit in bytes fastparquet
begins to throw these errors on write… The fact that this is a Thrift
object (where types are rigidly defined) suggests that this typing choice may be an inherent part of the parquet format itself; is this true?
I’m not unsure if I’m looking at a bug in fastparquet
, or if perhaps this is an intended design choice in the parquet format. I’ve been unable to get clarity on this anywhere else.
Issue Analytics
- State:
- Created 10 months ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
The argument
row_group_offsets
gives you control over how big the row groups are. The default is geared towards a “tall and narrow” table layout of the sort parquet was designed for.I’m surprised if there are frameworks that wouldn’t be able to read this output.
@martindurant thanks for all the support and prompt communication; closing this issue as it is not an issue with the lib