question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Equality between DataFrames misbehaves if columns contain NaN

See original GitHub issue

Code Sample, a copy-pastable example if possible

In [2]: s = pd.DataFrame(-1, index=[1, np.nan, 2,],
   ...:                  columns=[3, np.nan, 1])
   ...: 

In [3]: s + s # good
Out[3]: 
       3.0  NaN    1.0
 1.0    -2    -2    -2
NaN     -2    -2    -2
 2.0    -2    -2    -2

In [4]: s == s # bad
Out[4]: 
       3.0 NaN    1.0
 1.0  True  NaN  True
NaN   True  NaN  True
 2.0  True  NaN  True

Problem description

While it is true that np.nan != np.nan, pandas disregards this in indexes (indeed, s.loc[:, np.nan] works), so it should be coherent.

Expected Output

In [4]: s == s
Out[4]: 
       3.0  NaN   1.0
 1.0  True  True  True
NaN   True  True  True
 2.0  True  True  True

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b45325e283b16ec8869aaea407de8256fc234f33 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: it_IT.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28.dirty pytest: 3.2.3 pip: 9.0.1 setuptools: 36.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.0dev tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.2.1

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:17 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Jan 16, 2018

By the way, I’m not at all against having equality in your category (1) (dropping the exception I just described), and inserting NaNs… I didn’t propose it just because the change would be bigger and casting bools to objects is sad

I am not sure this was the reason. Because if comparison operations would align, you would 1) align introducing NaNs in the values and 2) compare and where there are NaNs you just get False (just as you would now get with already aligned objects that contains NaNs). So even if comparisons do alignment you can still get a normal functioning boolean result.

I think one of the reasons to not let the comparisons align was 1) make series behaviour consistent with dataframe (but of course, we could also have changed the dataframe behaviour to align as well) and 2) people liked the error as a sanity check (as often, when doing a comparison you want to use it for boolean indexing, and then if you get alignment, that might give unexpected results). One example use case that Wes gave: s1[1:] == s2[:1].

1reaction
shoyercommented, Nov 25, 2017

Interesting, but my understanding is that it does not consider the specific issue of having the same labels but in a different order. I understand the reason not to support comparison between different indexes is to avoid NaNs (or dropping elements/rows). What I suggest instead is just to check if labels are equal after sorting.

I don’t think this is a good idea. Most pandas operations already either (1) align arguments or (2) require identical labels. This would add a third type: (3) require same labels, in any order.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas DataFrames with NaNs equality comparison
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
Read more >
pandas.DataFrame.equals — pandas 1.5.2 documentation
NaNs in the same location are considered equal. The row/column index do not need to have the same type, as long as the...
Read more >
Checking If Any Value is NaN in a Pandas DataFrame - Chartio
Within pandas, a null value is considered missing and is denoted by NaN. This article details how to evalute pandas for missing data...
Read more >
Pandas - Check Any Value is NaN in DataFrame
By using isnull().values.any() method you can check if a pandas DataFrame contains NaN/None values in any cell (all rows & columns ). This...
Read more >
Comparing Pandas Dataframes To One Another | by Tony Yiu
Pandas dataframes are the workhorse of data science. ... were equal between the two dataframes (note that rows 1 and 3 contain errors)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found