BUG: Series.unique segfaults on invalid unicode
See original GitHub issueNot present anymore, might be fixed by accident, but no could not find a PR that did that.
Code Sample, a copy-pastable example
import pandas as pd
ser = pd.Series(["\ud83d"])
ser.unique()
Problem description
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Segmentation fault (core dumped)
Expected Output
Not crashing.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-33-generic machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.4 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : 5.4.1 hypothesis : None sphinx : 3.0.3 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.4.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
@jorisvandenbossche @marco-neumann-jdas @phofl I’m happy to do a PR with these tests if nobody is working on this already!
I would like to work on this issue as part of the PyData Amsterdam coding sprint.
So far I have confirmed that the released version of Pandas has the bug, and master is fixed. AFAIU, it needs a regression test. The issue was, for the specific payload of invalid unicode data calling
pandas.Series.unique(..)
was triggering a segfault. To cover similar issues in the future, I would imagine the test shouldn’t limit itself to the case of callingSeries.unique(..)
.So it should be added in the same place where
Series
is being tested for different kinds of payloads, is my thinking correct here? If so, I’m unclear where I should add the test. I seebase/test_unique.py
has tests forSeries.unique
, andarrays/string_/test_string.py
has tests for string payloads. Am I looking at the right place? Can someone please guide me? Also, how can I mark the test I add as a regression test?