Should DataFrame.merge match NaN with NaN?
See original GitHub issueCode Sample, a copy-pastable example if possible
pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')
Problem description
df1:
| a | b – | – | – 0 | 1 | 4.0 1 | 2 | 5.0 2 | 3 | NaN
df2:
| c | d – | – | – 0 | 6 | 4.0 1 | 7 | NaN 2 | 8 | NaN 3 | 9 | 5.0
Current output:
| a | b | c | d – | – | – | – | – 0 | 1 | 4.0 | 6 | 4.0 1 | 2 | 5.0 | 9 | 5.0 2 | 3 | NaN | 7 | NaN 3 | 3 | NaN | 8 | NaN
Expected Output
| a | b | c | d – | – | – | – | – 0 | 1 | 4.0 | 6 | 4.0 1 | 2 | 5.0 | 9 | 5.0
What’s happening is the NaN is df1.b
is matching the NaNs in df2.d
.
I don’t see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.
What do you think?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: None pip: 18.0 setuptools: 39.0.1 Cython: None numpy: 1.15.1 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.5.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.3 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:7 (6 by maintainers)
Top GitHub Comments
Agreed, I also would not expect NAs to match here. It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)
Closing as duplicate of https://github.com/pandas-dev/pandas/issues/32306 with a more recent discussion on the future policy we want.