pd.groupby(pd.TimeGrouper()) mishandles null values in dates
See original GitHub issueCode Sample, a copy-pastable example if possible
The code is updated following some comments
import pandas as pd
import random
from random import randint
random.seed(2)
data= [['2010-01-06', randint(1,9)],
['2010-08-26', randint(1,9)],
['2010-09-06', randint(1,9)],
['2010-09-16', 10],
['2010-09-20', 10],
['2010-09-23', 10],
['2010-09-24', randint(1,9)],
['2010-09-20', randint(1,9)],]
for m in range(1270):
data.append(['2010' + '-' + str(randint(10, 12)).zfill(2) + '-' + str(randint(1, 32)).zfill(2),
randint(1, 121111)])
df = pd.DataFrame(data)
df.columns = ['date', 'n']
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_r = df[df['date'].notnull()]
g1 = df.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
g2 = df_r.groupby(pd.TimeGrouper(key='date', freq='M'))['n'].nunique()
# This should print 'True' but it prints 'False'
print((g1==g2).mean() == 1)
Problem description
When a columns is used in TimeGrouper to group, null values are supposed to be ignored. This is indeed correct when dataset is small. However, the above given code demonstrates that when dataset is larger, sometimes distributes null values into some legit dates. Worst of all there was one time it inserted a value in a row and shifted the entire time series downwards. When I compare two grouped series it made me think one is leading another by 1 month, causing significant waste of resources as I was developing a financial model based on large datasets.
Updated comments after further investigation: This same piece of code behaves different on some different versions, although none of them, including the latest 0.20.3, produces correct results.
Expected Output
True
Output of pd.show_versions()
INSTALLED VERSIONS this is also updated
commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 0, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
@fujiaxiang : Thanks reporting this! Unfortunately, we can’t replicate your code because
df_r
is not defined in your code. Could you please fix that?yeah looks like we have an invalid comparision somewhere
These are the ‘same’ operations (though slightly different impl path).
if you can have a look would be appreciated.