Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift

See original GitHub issue

I’m having an issue storing a large dataset (around 40GB) in a single parquet file.

I’m using the fastparquet library to append pandas.DataFrames to this parquet dataset file, and everything goes fine until the dataset hits 2.3GB, at which point I get the following errors:

OverflowError: value too large to convert to int
Exception ignored in: 'fastparquet.cencoding.write_thrift'

Having debugged my way through the fastparquet code itself, it seems to me that what is happening internally is that to append rows, it has to update the page headers, and it seems that this is done by creating a Thrift object and writing it to a file:

# fastparquet/writer.py:581

ph = parquet_thrift.PageHeader(type=parquet_thrift.PageType.DATA_PAGE,
    uncompressed_page_size=l0,
    compressed_page_size=l1,
    data_page_header=dph, i32=1)

The problem could be that the uncompressed_page_size attribute is typed as int32, and so as the file grows, when it reaches that limit in bytes fastparquet begins to throw these errors on write… The fact that this is a Thrift object (where types are rigidly defined) suggests that this typing choice may be an inherent part of the parquet format itself; is this true?

I’m not unsure if I’m looking at a bug in fastparquet, or if perhaps this is an intended design choice in the parquet format. I’ve been unable to get clarity on this anywhere else.

Issue Analytics

State:
Created 10 months ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Nov 25, 2022

The argument row_group_offsets gives you control over how big the row groups are. The default is geared towards a “tall and narrow” table layout of the sort parquet was designed for.

another non-python language which doesn’t support that structure

I’m surprised if there are frameworks that wouldn’t be able to read this output.

0reactions

sikanrongcommented, Nov 25, 2022

@martindurant thanks for all the support and prompt communication; closing this issue as it is not an issue with the lib

Top Results From Across the Web

How do I debug OverflowError: value too large to convert to ...

A block_size of 10GB will not work. The error you are receiving is that the block size does not fit in a signed...

extract_features is failing with: "OverflowError: value too ...

I am getting this error: "Overflow Error: value too large to convert to int". Some comments: i) When I split the matrix vertically...

Source code for fastparquet.writer

Series fixed_text: int or None For str and bytes, the fixed-string length to use. ... convert(data, se): """Convert data according to the schema...

[Solved] OverflowError: Python int too large to convert to C long

OverflowError : Python int too large to convert to C long is a typical error in python which occurs when you initialize too...

fastparquet Documentation

3. choice of compression algorithms and encoding ... Fastparquet used to cast such columns to float, so that we could represent NULLs.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

OverflowError: value too large to convert to int - fastparquet.cencoding.write_thrift

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Environments and multi-task persistent state

`test_delta_from_def_2` fails on aarch64, armv7 and ppc64le