Equality between DataFrames misbehaves if columns contain NaN
See original GitHub issueCode Sample, a copy-pastable example if possible
In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
...: columns=[3, np.nan, 1])
...:
In [3]: s + s # good
Out[3]:
3.0 NaN 1.0
1.0 -2 -2 -2
NaN -2 -2 -2
2.0 -2 -2 -2
In [4]: s == s # bad
Out[4]:
3.0 NaN 1.0
1.0 True NaN True
NaN True NaN True
2.0 True NaN True
Problem description
While it is true that np.nan != np.nan
, pandas disregards this in indexes (indeed, s.loc[:, np.nan]
works), so it should be coherent.
Expected Output
In [4]: s == s
Out[4]:
3.0 NaN 1.0
1.0 True True True
NaN True True True
2.0 True True True
Output of pd.show_versions()
INSTALLED VERSIONS
commit: b45325e283b16ec8869aaea407de8256fc234f33 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: it_IT.UTF-8
pandas: 0.22.0.dev0+201.gb45325e28.dirty pytest: 3.2.3 pip: 9.0.1 setuptools: 36.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.0dev tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.2.1
Issue Analytics
- State:
- Created 6 years ago
- Comments:17 (17 by maintainers)
Top GitHub Comments
I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get
False
(just as you would now get with already aligned objects that contains NaNs). So even if comparisons do alignment you can still get a normal functioning boolean result.I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave:
s1[1:] == s2[:1]
.I don’t think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.