question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Binary operations with empty string arrays produce

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> pd.DataFrame({'a': pd.Series([], dtype='str')}) + 1
Empty DataFrame
Columns: [a]
Index: []
>>> pd.DataFrame({'a': pd.Series([], dtype='str')}) | 1
Empty DataFrame
Columns: [a]
Index: []
>>> pd.DataFrame({'a': pd.Series([None], dtype='str')}) + 1
     a
0  NaN
>>> pd.DataFrame({'a': pd.Series([None], dtype='str')}) | 1
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for |: 'NoneType' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
TypeError: Cannot perform 'or_' with a dtyped [object] array and scalar of type [bool]
>>> pd.DataFrame({'a': pd.Series([None, 'a'], dtype='str')}) + 1
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str

Issue Description

When a DataFrame contains a column of dtype object that is either empty or contains only None, it may allow certain binary operations to occur that would otherwise be invalid.

Expected Behavior

Errors should occur consistently across different binary operations between string columns and scalars based on the dtype of the scalar irrespective of whether that column is empty, contains only None, or contains actual strings.

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic Version : #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8

pandas : 1.3.5 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.2 pip : 22.0.4 setuptools : 59.8.0 Cython : 0.29.28 pytest : 7.0.1 hypothesis : 6.39.3 sphinx : 4.4.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.8.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None fsspec : 2022.02.0 fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.55.1

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
vyasrcommented, Mar 31, 2022

Yup, that all sounds right to me!

0reactions
edwhuang23commented, Mar 28, 2022

Gotcha, so it seems like you’re saying that the behavior of dtypes should not depend on content, which I absolutely agree with. Furthermore, I want to clarify: You are saying that performing a binary operation should work between two compatible columns according to their dtype (ex: integer column added to float column similar to the second commented example with two arrays added together) or that an operation between a column and a scalar value should be allowed if their dtypes are compatible, and that two incompatible dtypes should not work together (ex: integer added to a string as in your reproducible example) – is this correct?

In terms of proposed changes, I’m thinking of the following:

  1. Changing the current implementation so that it does not check whether the column is empty or is nonempty and contains nulls, and instead check by dtypes according to the behavior I described above.
  2. Ensuring consistency of valid binary operations between different dtypes, no matter if we are dealing with two columns or 1 column and a scalar value.

Let me know if I misunderstood anything.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling empty arrays in pySpark (optional binary element ...
The solution I've found is setting the option dropFieldIfAllNull to True when reading the json file. This causes field with empty array to ......
Read more >
I have an error while reading Binary file - NI Community
Hi AutoTec, there is no error, if you look into the created file, you will see that for every empty string in the...
Read more >
Search in an array of strings where non-empty strings are sorted
Given an array of strings. The array has both empty and non-empty strings. All non-empty strings are in sorted order. Empty strings can...
Read more >
Empty arrays and collections should be returned instead of null
Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your JAVA code · All rules 649 · Vulnerability54 · Bug154...
Read more >
Bug descriptions — spotbugs 4.7.3 documentation
String parameter for reference equality using the == or != operators. ... from a byte array are sign extended to 32 bits before...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found