question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift

See original GitHub issue

I’m having an issue storing a large dataset (around 40GB) in a single parquet file.

I’m using the fastparquet library to append pandas.DataFrames to this parquet dataset file, and everything goes fine until the dataset hits 2.3GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Having debugged my way through the fastparquet code itself, it seems to me that what is happening internally is that to append rows, it has to update the page headers, and it seems that this is done by creating a Thrift object and writing it to a file:

# fastparquet/writer.py:581

ph = parquet_thrift.PageHeader(type=parquet_thrift.PageType.DATA_PAGE,
    uncompressed_page_size=l0,
    compressed_page_size=l1,
    data_page_header=dph, i32=1)

The problem could be that the uncompressed_page_size attribute is typed as int32, and so as the file grows, when it reaches that limit in bytes fastparquet begins to throw these errors on write… The fact that this is a Thrift object (where types are rigidly defined) suggests that this typing choice may be an inherent part of the parquet format itself; is this true?

I’m not unsure if I’m looking at a bug in fastparquet, or if perhaps this is an intended design choice in the parquet format. I’ve been unable to get clarity on this anywhere else.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Nov 25, 2022

The argument row_group_offsets gives you control over how big the row groups are. The default is geared towards a “tall and narrow” table layout of the sort parquet was designed for.

another non-python language which doesn’t support that structure

I’m surprised if there are frameworks that wouldn’t be able to read this output.

0reactions
sikanrongcommented, Nov 25, 2022

@martindurant thanks for all the support and prompt communication; closing this issue as it is not an issue with the lib

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I debug OverflowError: value too large to convert to ...
A block_size of 10GB will not work. The error you are receiving is that the block size does not fit in a signed...
Read more >
extract_features is failing with: "OverflowError: value too ...
I am getting this error: "Overflow Error: value too large to convert to int". Some comments: i) When I split the matrix vertically...
Read more >
Source code for fastparquet.writer
Series fixed_text: int or None For str and bytes, the fixed-string length to use. ... convert(data, se): """Convert data according to the schema...
Read more >
[Solved] OverflowError: Python int too large to convert to C long
OverflowError : Python int too large to convert to C long is a typical error in python which occurs when you initialize too...
Read more >
fastparquet Documentation
3. choice of compression algorithms and encoding ... Fastparquet used to cast such columns to float, so that we could represent NULLs.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found