Crash caused by SIGSEGV in unpack_byte_array
See original GitHub issueWhat happened: Reading a parquet file to a pandas dataframe, fastparquet crashes due to an error in unpack_byte_array. I cannot share the parquet file due to PII. It has been created by AWS DMS.
Dump from parquet-tools - column name is ‘request’:
request: OPTIONAL BINARY L:STRING R:0 D:1
--
request: BINARY UNCOMPRESSED DO:0 FPO:1238 SZ:358286/358286/1.00 VC:10 ENC:RLE,PLAIN ST:[no stats for this column]
Ultimately this ends up with a call to:
read_plain(io_obj.read(), type_=6, count=10, width=None, utf=True, stat=False)
With a 283180 length byte array.
What you expected to happen: No crashes on reading these parquet files.
Minimal Complete Verifiable Example:
from fastparquet.encoding import read_plain
from fastparquet.cencoding import NumpyIO
def test_read_plain():
# raw_bytes = b'a' * 283180 # will also cause a SIGSEGV
raw_bytes = b'iJwY'
io_obj = NumpyIO(raw_bytes)
read_plain(io_obj.read(), type_=6, count=1, width=None, utf=True, stat=False)
Run this and you will get a stack dump:
tests/unit/test_fastparquet.py::test_read_plain Fatal Python error: Segmentation fault
Current thread 0x000000011abb4e00 (most recent call first):
File "/Users/nw/.virtualenvs/bp-lambdas-2/lib/python3.8/site-packages/fastparquet/encoding.py", line 41 in read_plain
File "/Users/nw/dev/src/backup-pipeline/transformer-v2/tests/unit/test_fastparquet.py", line 11 in test_read_plain
<snip>
fish: Job 1, 'pytest --pdb -vv tests/unit/tes…' terminated by signal SIGSEGV (Address boundary error)
Python version: Python 3.8.11 Operating System: Mac / AWS Lambda Lib Versions: fastparquet==0.7.1, thrift==0.13.0, numpy==1.21.2
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (8 by maintainers)
Top Results From Across the Web
What causes a SIGSEGV - segmentation fault - Stack Overflow
There are various causes of segmentation faults, but fundamentally, you are accessing memory incorrectly. This could be caused by ...
Read more >Identify what's causing segmentation faults (segfaults)
A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core...
Read more >Segmentation Fault in Linux Containers (exit code 139)
In this post you'll learn about the SIGSEGV error, and how to debug it when ... a segmentation error caused the application inside...
Read more >SIGSEGV Segmentation Fault JVM Crash | Confluence
The SIGSEGV message indicates Java itself is crashing. Cause. This is usually caused by a bug in the JVM, but in some cases,...
Read more >Why is there a "V" in SIGSEGV Segmentation Fault?
My program received a SIGSEGV signal and crashed with "Segmentation Fault" ... Accessing data over this limit caused a processor fault.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
(also, your versions of OS, python, thrift and numpy, if you wouldn’t mind)
Thanks for the extra information, it looks like it may be helpful, but will take a little time for me to digest.
Are you using fastparquet 0.7.1 ?