Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Series.unique segfaults on invalid unicode

See original GitHub issue

Not present anymore, might be fixed by accident, but no could not find a PR that did that.

Code Sample, a copy-pastable example

import pandas as pd

ser = pd.Series(["\ud83d"])
ser.unique()

Problem description

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'                                             
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Segmentation fault (core dumped)

Expected Output

Not crashing.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-33-generic machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : 5.4.1 hypothesis : None sphinx : 3.0.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

State:
Created 3 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

arw2019commented, Jun 5, 2020

@jorisvandenbossche @marco-neumann-jdas @phofl I’m happy to do a PR with these tests if nobody is working on this already!

0reactions

suvayucommented, Jun 16, 2020

I would like to work on this issue as part of the PyData Amsterdam coding sprint.

So far I have confirmed that the released version of Pandas has the bug, and master is fixed. AFAIU, it needs a regression test. The issue was, for the specific payload of invalid unicode data calling pandas.Series.unique(..) was triggering a segfault. To cover similar issues in the future, I would imagine the test shouldn’t limit itself to the case of calling Series.unique(..).

So it should be added in the same place where Series is being tested for different kinds of payloads, is my thinking correct here? If so, I’m unclear where I should add the test. I see base/test_unique.py has tests for Series.unique, and arrays/string_/test_string.py has tests for string payloads. Am I looking at the right place? Can someone please guide me? Also, how can I mark the test I add as a regression test?

Top Results From Across the Web

Segmentation fault with invalid Unicode command-line ...

The following embedded application, which calls Py_Main with a "-W X" argument where X is not a valid Unicode string, returns a segmentation...

Web Access Gateway bugs and problems

This is an old bug list about the old Web Access Gateway, which is no longer maintained, having been largely replaced by my...

Bug listing with status RESOLVED with resolution INVALID as ...

Bug :27 - "Test" status:RESOLVED resolution:INVALID severity:normal ... Bug:3126 - "Bad address or program segfaults under high load" status:RESOLVED ...

Debugging a segfault in my Rust program - Julia Evans

Hello! Yesterday I finished debugging a segfault. It was (in retrospect) a pretty easy thing to fix but I learned a few things...

ghc-pkg list crashes on Windows when unicode character is in ...

when the user name contains a non-ASCII character (in my case 日 ). I don't think it's important what the user name is,...