Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `pd.Grouper` creates empty groups (and in doing so is inconsistent with `groupby`) with `pd.DatetimeIndex`

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd


# note, does not include today
day_offset = [-2, -1, 1, 2]
timestamps = [pd.Timestamp.now() + pd.Timedelta(days=d) for d in day_offset]

df_ts = pd.DataFrame(
    data={"offset": day_offset},
    index=pd.Index(timestamps, name="timestamp")
)

print("\n===============")
print("{}\n".format(df_ts))

# want to group by the day
grouper = pd.Grouper(level="timestamp", freq="D")
grouped = df_ts.groupby(grouper)
print("[Grouper] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))

for key, df_group in grouped:
    print("{}: {}".format(key, len(df_group)))
    
    if key.date() == pd.Timestamp.now().date():
        print("...?") # unexpected!

df_date = pd.DataFrame(
    data={"offset": day_offset},
    index=pd.Index(df_ts.index.date, name='date')
)

print("\n===============")
print("{}\n".format(df_date))

grouped = df_date.groupby('date')
print("[Groupby] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))

for key, df_group in grouped:
    print("{}: {}".format(key, len(df_group)))

Issue Description

Hello,

In the example above, I am trying to group datetimes to dates, using the pd.Grouper(..., freq="D"), however this creates a key for a date which doesn’t exist in the data. This behaviour is different to using groupby on dates directly (which does not create a missing key). Also note this creates a mismatch in the size of the .groups and the .indices (as the latter does not include the empty reference)

This creates unexpected behaviour when you try to loop over the groups later (and have more groups than you’d expect).

This is also not particularly easy to handle on the user end (if we do not wish to reconstruct the groupby object). Instead I have resorted to looping over the grouped.indices.keys() and the using get_group to avoid the empty group.

Expected Behavior

The empty group would not be created with the pd.Grouper

Installed Versions

commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.10.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 61.2.0 pip : 22.1.2 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.1.1 matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Issue Analytics

State:
Created a year ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

naddeoacommented, Nov 20, 2022

I’m running into this too. I think a concrete example might help to set expectations. Take the following data in data.csv (yoinked from a hacker news data set)

id,by,author,time,time_ts,text,parent,deleted,dead,ranking
9734136,,,1434565400,2015-06-17 18:23:20.000000 UTC,,9733698,true,,0
4921158,,,1355496966,2012-12-14 14:56:06.000000 UTC,,4921100,true,,0

I can group this dataset (that consists of two lines) using the following snippet.

import pandas as pd

df = pd.read_csv('./data.csv')

date_col = 'time_ts'

df[date_col] = pd.to_datetime(df[date_col])
grouped = df.set_index(date_col).groupby(pd.Grouper(freq='D'))

for date_group, dataframe in grouped:
    if len(dataframe) == 0:
        print(f'Empty group for {date_group}')
        continue
    else:
      print(f'Data for {date_group} : {len(dataframe)}')

print(f'result count {len(grouped)}')

There will be 916 results because every day is included between the two days that actually exist. All but 2 of those 916 results will just be empty data frames That is definitely not what I expected to happen.

I’m not sure what was meant by a “more resampling approach was used”, but the documentation definitely makes it seem like this will just group the data by date and it doesn’t mention creating every possible date as well. But even if it did mention creating every possible date, why would anyone want that? I must be very confused about the purpose of the API if that behavior is desirable. At best I could see it being a no-op for cases where you just use aggregation functions and the math happens to work out such that this wasn’t an issue.

0reactions

pratyushsharancommented, Aug 8, 2022

@jreback any thoughts if this should be treated as a bug or a documentation issue (enhancement)?

Top Results From Across the Web

Pandas Groupby Consistent levels even if empty

You can create a new index based on the two Cat columns and reindex your results: import pandas as pd new_index = pd....

What's new in 1.5.0 (September 19, 2022) - Pandas

As with DataFrame.groupby() , this argument controls the whether each group is added to the index in the resample when Resampler.apply() is ...

What's new in 0.25.0 (July 18, 2019) - Joris Van den Bossche

Pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation functions to specific ...

v0.25.0 版本特性（2019年7月18日） - Pandas 中文

Now every group is evaluated only a single time. In [20]: df = pd.DataFrame({ ...

dask.dataframe.groupby - Dask documentation

So we have to do this step # at the start before any shuffling occurs so that ... frame. new_cats = full_index[~full_index.isin(result.index)] empty...