BUG: `dropna=False` not respected for groupby aggs on result of concatenated dataframes
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df1 = pd.DataFrame(
{
"a": [1, 2, 3, 4],
"b": [1, None, 1, 3],
"c": [4, 5, 6, 3],
}
)
df2 = pd.DataFrame(
{
"a": [None, None, 7, 8],
"b": [None, 3, 1, 3],
"c": [2, 1, 0, 0],
}
)
res1 = df1.groupby(["a", "b"], dropna=False).sum()
res2 = df2.groupby(["a", "b"], dropna=False).sum()
pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()
Issue Description
In some cases (haven’t narrowed this to an exact cause), dropna=False
is not respected when doing groupby aggregations on the result of a concat
operation. In this example, it is specifically coming up when the dataframes concatenated are the results of multi-column groupby aggregations with dropna=False
.
For context, this was discovered through debugging https://github.com/dask/dask/issues/8817; Dask’s apply-concat-apply model generally depends on applying operations to several dataframes (partitions of one large Dask dataframe), concatenating these results together, and applying a final aggregating operation on the concat result. As a result, this behavior is breaking dropna=False
support for all multi-column groupby aggregations in Dask.
Expected Behavior
# expected
pd.concat([df1, df2]).groupby(["a", "b"], dropna=False).sum()
# c
# a b
# 1.0 1.0 4
# 2.0 NaN 5
# 3.0 1.0 6
# 4.0 3.0 3
# 7.0 1.0 0
# 8.0 3.0 0
# NaN 3.0 1
# NaN 2
# actual
pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()
# c
# a b
# 1.0 1.0 1
# 3.0 1.0 1
# 4.0 3.0 1
# 7.0 1.0 1
# 8.0 3.0 1
Installed Versions
INSTALLED VERSIONS
commit : dafa5dd84acc1ba1b2641fd0bb6d3ca3594a5e9e python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-1083-oracle Version : #91-Ubuntu SMP Mon Oct 25 06:45:22 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.5.0.dev0+682.gdafa5dd84a numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 62.1.0 Cython : 0.29.28 pytest : 7.1.1 hypothesis : 6.43.1 sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.2.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.1 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.35 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Thanks for the report @charlesbluca! This can be reproduced without using
concat
, e.g.:This occurs when
df
has a Multiindex. The methodMultiIndex._get_grouper_for_level
does not takedropna
into account. For the groupby methods, when dropna is False the codes need to be nonnegative.Further investigations and PRs to fix are certainly welcome, if this hasn’t been resolved I do plan to take this up alongside some other improvements with dropna I’ve been working on at some point in the future.
Please take the example and use it in a test as you see fit! No citation needed.