question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+

See original GitHub issue

Code Sample, a copy-pastable example

This test will pass on pandas 1.2.5 but fail on 1.3 onwards (checked 1.3, 1.3.2).

def test_convert_dtypes_on_byte_string():
    byte_str = b'binary-string'
    dataframe = pd.DataFrame(
        data={
            "data": byte_str,
        },
        index=[0]
    )
    converted_dtypes = dataframe.convert_dtypes()
    assert converted_dtypes['data'][0].decode('ascii') == "binary-string"

Problem description

Error on 1.3 onwards:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in test_convert_dtypes_on_byte_string
AttributeError: 'str' object has no attribute 'decode'

The issue is that converted_dtypes() will convert the data column values to the string b'binary-data', almost like it has had str(val) called on it. On 1.2.5 it is left correctly as a byte array.

Expected Output

Test should pass. Byte strings should not be converted to string.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b python : 3.8.11.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.3.2 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.3 setuptools : 57.4.0 Cython : None pytest : 6.2.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.1 (dt dec pq3 ext lo64) jinja2 : 3.0.1 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : 2021.07.0 fastparquet : 0.7.0 gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : 2021.07.0 scipy : None sqlalchemy : 1.4.23 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
MarcoGorellicommented, Aug 23, 2021

Thanks @nwilliams-lw - from git bisect it looks like this was caused by https://github.com/pandas-dev/pandas/pull/41622 , cc @jbrockmendel

0reactions
shubham11941140commented, Sep 12, 2021

I agree with your point @simonjayhawkins, but this is different from the convert_dtypes issue mentioned, I can open up another PR and we can have a conversation to solve this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Converting to dataframe converts values to byte strings #770
The problem seems to be the conversion from raw values to interpreted values. Somewhere along the line the values are converted from integers...
Read more >
Problem when converting bytes [] to string, then back to byte[]
It's because of conversion of the raw byte array to UTF-8 String. Since not every byte sequence is a valid UTF-8 string so...
Read more >
pandas.DataFrame.convert_dtypes
Whether object dtypes should be converted to the best possible types. ... Start with a Series of strings and missing data represented by...
Read more >
Type Conversion Functions - L3HarrisGeospatial.com
Converting a byte array with the STRING function yields a string array or scalar with one less dimension than the byte array. Dynamic...
Read more >
Converting integer to byte string problem in python 3
I have a library function that is failing because I cannot seem to properly convert an integer to a proper byte string except...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found