question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.groupby().sum() treating Nan as 0.0

See original GitHub issue

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0


Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.36.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.3 setuptools: 39.0.1 Cython: 0.28.2 numpy: 1.14.2 scipy: 1.0.1 pyarrow: 0.9.0 xarray: 0.10.2 IPython: 5.6.0 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: 2.5.2 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.2.1 bs4: 4.3.2 html5lib: 0.999 sqlalchemy: 1.2.6 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:12
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

8reactions
kitschencommented, Feb 10, 2020

I’m using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=2), None,    timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
        ] 

df = pd.DataFrame(data, columns = ['date', 'duration', 'total']) 
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)



Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

------
Found a workaround, namely to use

`res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))`

instead of

`res=df.groupby(level=0).sum(min_count=1)`
7reactions
TomAugspurgercommented, Apr 25, 2018

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64
Read more comments on GitHub >

github_iconTop Results From Across the Web

Group by and find sum for groups but return NaN as NaN, not 0
I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values. Desired output: time id ......
Read more >
Working with missing data — pandas 1.5.2 documentation
When summing data, NA (missing) values will be treated as zero. If the data are all NA, the result will be 0. Cumulative...
Read more >
How to get pd.DataFrame.groupby not to drop NaN values in ...
First one is to replace (before you do the groupby) the NaN values with a suitable 'neutral' value. You can use the pd.fillna()...
Read more >
Pandas Groupby Warning - Practical Business Python -
When working with pandas groupby , the results can be surprising if you have NaN values in your dataframe columns. The default behavior...
Read more >
dataframe.groupby().sum() treating nan as 0.0 - 掘金
掘金是一个帮助开发者成长的社区,dataframe.groupby().sum() treating nan as 0.0技术文章由稀土上聚集的技术大牛和极客共同编辑为你筛选出最优质的干货, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found