question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Series.unique segfaults on invalid unicode

See original GitHub issue

Not present anymore, might be fixed by accident, but no could not find a PR that did that.

Code Sample, a copy-pastable example

import pandas as pd

ser = pd.Series(["\ud83d"])
ser.unique()

Problem description

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'                                             
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Segmentation fault (core dumped)             

Expected Output

Not crashing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-33-generic machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : 5.4.1 hypothesis : None sphinx : 3.0.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
arw2019commented, Jun 5, 2020

@jorisvandenbossche @marco-neumann-jdas @phofl I’m happy to do a PR with these tests if nobody is working on this already!

0reactions
suvayucommented, Jun 16, 2020

I would like to work on this issue as part of the PyData Amsterdam coding sprint.

So far I have confirmed that the released version of Pandas has the bug, and master is fixed. AFAIU, it needs a regression test. The issue was, for the specific payload of invalid unicode data calling pandas.Series.unique(..) was triggering a segfault. To cover similar issues in the future, I would imagine the test shouldn’t limit itself to the case of calling Series.unique(..).

So it should be added in the same place where Series is being tested for different kinds of payloads, is my thinking correct here? If so, I’m unclear where I should add the test. I see base/test_unique.py has tests for Series.unique, and arrays/string_/test_string.py has tests for string payloads. Am I looking at the right place? Can someone please guide me? Also, how can I mark the test I add as a regression test?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault with invalid Unicode command-line ...
The following embedded application, which calls Py_Main with a "-W X" argument where X is not a valid Unicode string, returns a segmentation...
Read more >
Web Access Gateway bugs and problems
This is an old bug list about the old Web Access Gateway, which is no longer maintained, having been largely replaced by my...
Read more >
Bug listing with status RESOLVED with resolution INVALID as ...
Bug :27 - "Test" status:RESOLVED resolution:INVALID severity:normal ... Bug:3126 - "Bad address or program segfaults under high load" status:RESOLVED ...
Read more >
Debugging a segfault in my Rust program - Julia Evans
Hello! Yesterday I finished debugging a segfault. It was (in retrospect) a pretty easy thing to fix but I learned a few things...
Read more >
ghc-pkg list crashes on Windows when unicode character is in ...
when the user name contains a non-ASCII character (in my case 日 ). I don't think it's important what the user name is,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found