Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.groupby().sum() treating Nan as 0.0

See original GitHub issue

Code Sample, a copy-pastable example if possible

In [62]: import pandas as pd

In [63]: import numpy as np

In [64]: df = pd.DataFrame(data=[['data1', 2, np.nan], ['data2', 3, 4], ['data3', 4, 4]], index=[1, 2, 3], columns=['a', 'b', 'c'])

In [68]: df
Out[68]:
       a  b    c
1  data1  2  NaN
2  data2  3  4.0
3  data3  4  4.0

In [65]: df.groupby(by=['a','b']).sum(skipna=False)
Out[65]:
           c
a     b
data1 2  0.0
data2 3  4.0
data3 4  4.0

Problem description

The Nan value is being treated as 0.0. Is there an option to treat Nan as Nan and sum() to return Nan?

Expected Output

           c
a     b
data1 2  NaN
data2 3  4.0
data3 4  4.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.36.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.3 setuptools: 39.0.1 Cython: 0.28.2 numpy: 1.14.2 scipy: 1.0.1 pyarrow: 0.9.0 xarray: 0.10.2 IPython: 5.6.0 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: None matplotlib: 2.2.2 openpyxl: 2.5.2 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.2.1 bs4: 4.3.2 html5lib: 0.999 sqlalchemy: 1.2.6 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:12
Comments:15 (7 by maintainers)

Top GitHub Comments

8reactions

kitschencommented, Feb 10, 2020

I’m using latest v1.0.1 but still see this issue. Also the min_count=1 argument seems to not work (for timedeltas at least). Any suggestions on how to keep the nan in a groupy().sum()?

import pandas as pd
from datetime import datetime, date, timedelta

data = [[date(year=2020,month=2,day=1), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=2), None,    timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ],
        [date(year=2020,month=2,day=3), timedelta(hours=1, minutes=10),timedelta(hours=2, minutes=10) ]
        ] 

df = pd.DataFrame(data, columns = ['date', 'duration', 'total']) 
df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

res=df.groupby(level=0).sum(min_count=1)
display(res)



Expected:
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | nan | 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

But getting
date  | duration | total
2020-02-01 | 01:10:00 | 02:10:00
2020-02-02 | 00:00:00| 02:10:00
2020-02-03 | 02:20:00 | 04:20:00

------
Found a workaround, namely to use

`res=df.groupby(level=0).apply(lambda x: x.sum(min_count=1))`

instead of

`res=df.groupby(level=0).sum(min_count=1)`

7reactions

TomAugspurgercommented, Apr 25, 2018

I think you want min_count:

In [20]: df.groupby(['a', 'b']).c.sum()
Out[20]:
a      b
data1  2    0.0
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64

In [21]: df.groupby(['a', 'b']).c.sum(min_count=1)
Out[21]:
a      b
data1  2    NaN
data2  3    4.0
data3  4    4.0
Name: c, dtype: float64