question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `dropna=False` not respected for groupby aggs on result of concatenated dataframes

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df1 = pd.DataFrame(
    {
        "a": [1, 2, 3, 4],
        "b": [1, None, 1, 3],
        "c": [4, 5, 6, 3],
    }
)

df2 = pd.DataFrame(
    {
        "a": [None, None, 7, 8],
        "b": [None, 3, 1, 3],
        "c": [2, 1, 0, 0],
    }
)

res1 = df1.groupby(["a", "b"], dropna=False).sum()
res2 = df2.groupby(["a", "b"], dropna=False).sum()

pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()

Issue Description

In some cases (haven’t narrowed this to an exact cause), dropna=False is not respected when doing groupby aggregations on the result of a concat operation. In this example, it is specifically coming up when the dataframes concatenated are the results of multi-column groupby aggregations with dropna=False.

For context, this was discovered through debugging https://github.com/dask/dask/issues/8817; Dask’s apply-concat-apply model generally depends on applying operations to several dataframes (partitions of one large Dask dataframe), concatenating these results together, and applying a final aggregating operation on the concat result. As a result, this behavior is breaking dropna=False support for all multi-column groupby aggregations in Dask.

Expected Behavior

# expected
pd.concat([df1, df2]).groupby(["a", "b"], dropna=False).sum()

#          c
# a   b
# 1.0 1.0  4
# 2.0 NaN  5
# 3.0 1.0  6
# 4.0 3.0  3
# 7.0 1.0  0
# 8.0 3.0  0
# NaN 3.0  1
#     NaN  2

# actual
pd.concat([res1, res2]).groupby(["a", "b"], dropna=False).sum()

#          c
# a   b
# 1.0 1.0  1
# 3.0 1.0  1
# 4.0 3.0  1
# 7.0 1.0  1
# 8.0 3.0  1

Installed Versions

INSTALLED VERSIONS

commit : dafa5dd84acc1ba1b2641fd0bb6d3ca3594a5e9e python : 3.8.13.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-1083-oracle Version : #91-Ubuntu SMP Mon Oct 25 06:45:22 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+682.gdafa5dd84a numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : 22.0.4 setuptools : 62.1.0 Cython : 0.29.28 pytest : 7.1.1 hypothesis : 6.43.1 sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : 4.8.0 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 8.2.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : 1.3.4 brotli : fastparquet : 0.8.0 fsspec : 2021.11.0 gcsfs : 2021.11.0 markupsafe : 2.1.1 matplotlib : 3.5.1 numba : 0.53.1 numexpr : 2.8.0 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : 1.1.4 pyxlsb : None s3fs : 2021.11.0 scipy : 1.8.0 snappy : sqlalchemy : 1.4.35 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 2.0.1 xlwt : 1.3.0 zstandard : None

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
rhshadrachcommented, Apr 15, 2022

Thanks for the report @charlesbluca! This can be reproduced without using concat, e.g.:

df = pd.DataFrame(
    {
        'a': [1, np.nan, np.nan],
        'b': [1, 1, np.nan],
        'c': [2, 3, 4],
    }
).set_index(['a', 'b'])
print(df.groupby(["a", "b"], dropna=False).sum())

         c
a   b     
1.0 1.0  2

This occurs when df has a Multiindex. The method MultiIndex._get_grouper_for_level does not take dropna into account. For the groupby methods, when dropna is False the codes need to be nonnegative.

Further investigations and PRs to fix are certainly welcome, if this hasn’t been resolved I do plan to take this up alongside some other improvements with dropna I’ve been working on at some point in the future.

1reaction
rhshadrachcommented, Apr 21, 2022

Please take the example and use it in a test as you see fit! No citation needed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Group by: split-apply-combine — pandas 1.5.2 documentation
Splitting the data into groups based on some criteria. Applying a function to each group independently. Combining the results into a data structure....
Read more >
pandas GroupBy columns with NaN (missing) values
Note that as of this writing, there is a bug that makes dropna=False fail with MultiIndex grouping. There are a handful of open...
Read more >
DataFrame.groupby - Dask documentation
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group...
Read more >
DataFrame Reference — PyODPS 0.11.2.2 documentation
Table or pandas DataFrame) – ODPS table or pandas DataFrame ... df.groupby('title').agg(count=df.movie_id.count()).sort('count', ... Concat collections.
Read more >
What's New — pandas 0.23.4 documentation
Bug where calling DataFrameGroupBy.agg() with a list of functions including ... make output of DataFrame.apply consistent; Concatenation will no longer sort ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found