BUG: Inconsistent behavior between `None` and `pd.NA`
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
>>> import pandas as pd
>>> x = pd.Series(['90210', '60018-0123', '10010', 'text', '6.7', "<class 'object'>", '123456'], dtype="string")
>>> invalid = pd.Series([False, False, False, True, True, True, True])
>>> x[invalid] = None
>>> x[6] = pd.NA
>>> x[2] = None
>>> x
0 90210
1 60018-0123
2 <NA>
3 None
4 None
5 None
6 <NA>
dtype: string
Issue Description
There is inconsistent behavior when assigning missing values to a string series. As shown in the reproducible example, using an integer to assign None
will result in a conversion to pd.NA
, whereas using a boolean series to assign None
will result in None
with no conversion.
Expected Behavior
I’d expect there to be a single representation for missing values, so assigning x[invalid] = None
would result in a conversion to pd.NA
.
>>> import pandas as pd
>>> x = pd.Series(['90210', '60018-0123', '10010', 'text', '6.7', "<class 'object'>", '123456'], dtype="string")
>>> invalid = pd.Series([False, False, False, True, True, True, True])
>>> x[invalid] = None
>>> x
0 90210
1 60018-0123
2 10010
3 <NA>
4 <NA>
5 <NA>
6 <NA>
dtype: string
Installed Versions
INSTALLED VERSIONS
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 21.3.0 Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.4.3 numpy : 1.22.1 pytz : 2021.3 dateutil : 2.8.2 setuptools : 58.0.4 pip : 22.1 Cython : None pytest : 7.0.1 hypothesis : None sphinx : 4.2.0 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 7.31.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : None fastparquet : None fsspec : 2022.01.0 gcsfs : None markupsafe : 2.0.1 matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.7.3 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
I opened a PR with a potential fix -> https://github.com/pandas-dev/pandas/pull/47763
Indeed, it is the design goal that the StringArray only uses NA as missing value. And for example upon construction this also happens (conversion from None to NA), like with the scalar setitem:
This is the documented behaviour (see https://pandas.pydata.org/docs/reference/api/pandas.arrays.StringArray.html in addition to the link @phofl provided), and intentional:
https://github.com/pandas-dev/pandas/blob/2eca7e1f384cab7a677678a44ebfc6331d52190d/pandas/core/arrays/string_.py#L309-L323
But so it seems that for the case of
__setitem__
with a mask, we don’t do this validation on the input, and that is certainly a bug.Thanks for the report!