Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `dropna=False` not respected for groupby aggs on result of concatenated dataframes

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df1 = pd.DataFrame(
    {
        "a": [1, 2, 3, 4],
        "b": [1, None, 1, 3],
        "c": [4, 5, 6, 3],
    }
)

df2 = pd.DataFrame(
    {
        "a": [None, None, 7, 8],
        "b": [None, 3, 1, 3],
        "c": [2, 1, 0, 0],
    }
)

res1 = df1.groupby(["a", "b"], dropna=False).sum()
res2 = df2.groupby(["a", "b"], dropna=False).sum()

pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()

Issue Description

In some cases (haven’t narrowed this to an exact cause), dropna=False is not respected when doing groupby aggregations on the result of a concat operation. In this example, it is specifically coming up when the dataframes concatenated are the results of multi-column groupby aggregations with dropna=False.

For context, this was discovered through debugging https://github.com/dask/dask/issues/8817; Dask’s apply-concat-apply model generally depends on applying operations to several dataframes (partitions of one large Dask dataframe), concatenating these results together, and applying a final aggregating operation on the concat result. As a result, this behavior is breaking dropna=False support for all multi-column groupby aggregations in Dask.

Expected Behavior

# expected
pd.concat([df1, df2]).groupby(["a", "b"], dropna=False).sum()

#          c
# a   b
# 1.0 1.0  4
# 2.0 NaN  5
# 3.0 1.0  6
# 4.0 3.0  3
# 7.0 1.0  0
# 8.0 3.0  0
# NaN 3.0  1
#     NaN  2

# actual
pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()

#          c
# a   b
# 1.0 1.0  1
# 3.0 1.0  1
# 4.0 3.0  1
# 7.0 1.0  1
# 8.0 3.0  1

Installed Versions

INSTALLED VERSIONS

commit : dafa5dd84acc1ba1b2641fd0bb6d3ca3594a5e9e python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-1083-oracle Version : #91-Ubuntu SMP Mon Oct 25 06:45:22 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+682.gdafa5dd84a numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 62.1.0 Cython : 0.29.28 pytest : 7.1.1 hypothesis : 6.43.1 sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.2.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.1 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.35 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

4reactions

rhshadrachcommented, Apr 15, 2022

Thanks for the report @charlesbluca! This can be reproduced without using concat, e.g.:

df = pd.DataFrame(
    {
        'a': [1, np.nan, np.nan],
        'b': [1, 1, np.nan],
        'c': [2, 3, 4],
    }
).set_index(['a', 'b'])
print(df.groupby(["a", "b"], dropna=False).sum())

         c
a   b     
1.0 1.0  2

This occurs when df has a Multiindex. The method MultiIndex._get_grouper_for_level does not take dropna into account. For the groupby methods, when dropna is False the codes need to be nonnegative.

Further investigations and PRs to fix are certainly welcome, if this hasn’t been resolved I do plan to take this up alongside some other improvements with dropna I’ve been working on at some point in the future.

1reaction

rhshadrachcommented, Apr 21, 2022

Please take the example and use it in a test as you see fit! No citation needed.

Top Results From Across the Web

Group by: split-apply-combine — pandas 1.5.2 documentation

Splitting the data into groups based on some criteria. Applying a function to each group independently. Combining the results into a data structure....

pandas GroupBy columns with NaN (missing) values

Note that as of this writing, there is a bug that makes dropna=False fail with MultiIndex grouping. There are a handful of open...

DataFrame.groupby - Dask documentation

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group...

DataFrame Reference — PyODPS 0.11.2.2 documentation

Table or pandas DataFrame) – ODPS table or pandas DataFrame ... df.groupby('title').agg(count=df.movie_id.count()).sort('count', ... Concat collections.

What's New — pandas 0.23.4 documentation

Bug where calling DataFrameGroupBy.agg() with a list of functions including ... make output of DataFrame.apply consistent; Concatenation will no longer sort ...