question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.groupby(pd.TimeGrouper()) mishandles null values in dates

See original GitHub issue

Code Sample, a copy-pastable example if possible

The code is updated following some comments

import pandas as pd
import random
from random import randint

random.seed(2)
data= [['2010-01-06', randint(1,9)],
       ['2010-08-26', randint(1,9)],
       ['2010-09-06', randint(1,9)],
       ['2010-09-16', 10],
       ['2010-09-20', 10],
       ['2010-09-23', 10],
       ['2010-09-24', randint(1,9)],
       ['2010-09-20', randint(1,9)],]

for m in range(1270):
    data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
                randint(1, 121111)])

df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]

g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)

Problem description

When a columns is used in TimeGrouper to group, null values are supposed to be ignored. This is indeed correct when dataset is small. However, the above given code demonstrates that when dataset is larger, sometimes distributes null values into some legit dates. Worst of all there was one time it inserted a value in a row and shifted the entire time series downwards. When I compare two grouped series it made me think one is leading another by 1 month, causing significant waste of resources as I was developing a financial model based on large datasets.

Updated comments after further investigation: This same piece of code behaves different on some different versions, although none of them, including the latest 0.20.3, produces correct results.

Expected Output

True

Output of pd.show_versions()

INSTALLED VERSIONS this is also updated

commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
gfyoungcommented, Sep 18, 2017

@fujiaxiang : Thanks reporting this! Unfortunately, we can’t replicate your code because df_r is not defined in your code. Could you please fix that?

0reactions
jrebackcommented, Sep 19, 2017

yeah looks like we have an invalid comparision somewhere

These are the ‘same’ operations (though slightly different impl path).

In [9]: df.resample('M', on='date').nunique()
[16:28:10.847 WARNING] /home/jreback/pandas-dev/pandas/core/groupby.py:3167: FutureWarning: In the future, NAT != NAT will be True rather than False.
  inc = np.r_[1, val[1:] != val[:-1]]

Out[9]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     5    4
2010-09-30    31  402
2010-10-31    30  397
2010-11-30    31  417
2010-12-31     1    1

[12 rows x 2 columns]

In [10]: df_r.resample('M', on='date').nunique()
Out[10]: 
            date    n
date                 
2010-01-31     1    1
2010-02-28     0    0
2010-03-31     0    0
2010-04-30     0    0
2010-05-31     0    0
...          ...  ...
2010-08-31     1    1
2010-09-30     5    4
2010-10-31    31  402
2010-11-30    30  397
2010-12-31    31  417

[12 rows x 2 columns]

if you can have a look would be appreciated.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Group by on pandas datetime with null values - Stack Overflow
You can do a right join to set of dates you want. import pandas as pd df = pd.DataFrame({"date":pd.date_range("1-feb-2020", freq="4d", ...
Read more >
GroupBy — pandas 1.5.2 documentation
Compute count of group, excluding missing values. GroupBy.cumcount ([ascending]). Number each item in each group from 0 to the length of that group...
Read more >
Pandas Grouper and Agg Functions Explained
Grouper functions above some missing data, the resample method was incorrectly group summing the data, whereas the pd.Grouper method was spot on ...
Read more >
Group and Aggregate your Data Better using Pandas Groupby
Aggregation and grouping of Dataframes is accomplished in Python Pandas using “groupby()” and “agg()” functions. Apply max, min, count, distinct to groups.
Read more >
Pandas groupby() and count() with Examples
You can use pandas DataFrame.groupby().count() to group columns and compute the count or ... Create a pandas DataFrame. import pandas as pd technologies ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found