question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fastparquet v 0.6.1 crashes on read

See original GitHub issue

What happened:

Loading a 32Mb parquet file crashes with a core dump. It takes up 3Gb of memory before crashing.

What you expected to happen:

Prior to this version (unsure if 0.5.0 or 0.6.0) loading the same file worked with no problem. Tested on 0.5.0 without issues. Will test on 0.6.0 also.

Minimal Complete Verifiable Example:

Sorry I can’t provide the file as it is private data.

pd.read_parquet(filename)

Anything else we need to know?:

This issue appeared today (2021-05-12) so we assume it is related to release 0.6.1. Switching to pyarrow fixed the problem immediately. Previously, pyarrow was not installed and fastparquet was always used. Our Jupyter environments are ephemeral and torn down every day, so we reinstall new versions daily for work. Thus we assume 0.6.1 introduced the bug.

Environment:

  • Dask version: ? 0.6.1 fastparquet
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04.1 LTS (Focal Fossa) on AWS
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, May 13, 2021

Note that you helped me to identify a speedup of ~15% for UTF8 string reading, so I’m almost glad this bug was there.

1reaction
martindurantcommented, May 13, 2021

OK thank you - I should be able to work with that

Read more comments on GitHub >

github_iconTop Results From Across the Web

fastparquet Documentation - Read the Docs
This package aims to provide a performant library to read and write Parquet files from Python, without any need for a Python-. Java...
Read more >
Crash caused by SIGSEGV in unpack_byte_array #666 - GitHub
What happened: Reading a parquet file to a pandas dataframe, fastparquet crashes due to an error in unpack_byte_array.
Read more >
fastparquet — fastparquet 0.7.1 documentation - Read the Docs
This package aims to provide a performant library to read and write Parquet files from Python, without any need for a Python-Java bridge....
Read more >
fastparquet 0.6.1 - PyPI
fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. Not all parts of the parquet-format ...
Read more >
What's New — pandas 0.21.0 documentation - PyData |
Apache Parquet provides a cross-language, binary file format for reading and ... This functionality depends on either the pyarrow or fastparquet library.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found