BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+
See original GitHub issueCode Sample, a copy-pastable example
This test will pass on pandas 1.2.5 but fail on 1.3 onwards (checked 1.3, 1.3.2).
def test_convert_dtypes_on_byte_string():
byte_str = b'binary-string'
dataframe = pd.DataFrame(
data={
"data": byte_str,
},
index=[0]
)
converted_dtypes = dataframe.convert_dtypes()
assert converted_dtypes['data'][0].decode('ascii') == "binary-string"
Problem description
Error on 1.3 onwards:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 10, in test_convert_dtypes_on_byte_string
AttributeError: 'str' object has no attribute 'decode'
The issue is that converted_dtypes() will convert the data
column values to the string b'binary-data'
, almost like it has had str(val)
called on it. On 1.2.5 it is left correctly as a byte array.
Expected Output
Test should pass. Byte strings should not be converted to string.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b python : 3.8.11.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8
pandas : 1.3.2 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.3 setuptools : 57.4.0 Cython : None pytest : 6.2.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.1 (dt dec pq3 ext lo64) jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : 2021.07.0 fastparquet : 0.7.0 gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : 2021.07.0 scipy : None sqlalchemy : 1.4.23 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (9 by maintainers)
Thanks @nwilliams-lw - from git bisect it looks like this was caused by https://github.com/pandas-dev/pandas/pull/41622 , cc @jbrockmendel
I agree with your point @simonjayhawkins, but this is different from the convert_dtypes issue mentioned, I can open up another PR and we can have a conversation to solve this.