BUG: `pd.Grouper` creates empty groups (and in doing so is inconsistent with `groupby`) with `pd.DatetimeIndex`
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# note, does not include today
day_offset = [-2, -1, 1, 2]
timestamps = [pd.Timestamp.now() + pd.Timedelta(days=d) for d in day_offset]
df_ts = pd.DataFrame(
data={"offset": day_offset},
index=pd.Index(timestamps, name="timestamp")
)
print("\n===============")
print("{}\n".format(df_ts))
# want to group by the day
grouper = pd.Grouper(level="timestamp", freq="D")
grouped = df_ts.groupby(grouper)
print("[Grouper] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))
for key, df_group in grouped:
print("{}: {}".format(key, len(df_group)))
if key.date() == pd.Timestamp.now().date():
print("...?") # unexpected!
df_date = pd.DataFrame(
data={"offset": day_offset},
index=pd.Index(df_ts.index.date, name='date')
)
print("\n===============")
print("{}\n".format(df_date))
grouped = df_date.groupby('date')
print("[Groupby] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))
for key, df_group in grouped:
print("{}: {}".format(key, len(df_group)))
Issue Description
Hello,
In the example above, I am trying to group datetimes to dates, using the pd.Grouper(..., freq="D")
, however this creates a key for a date which doesn’t exist in the data. This behaviour is different to using groupby
on dates directly (which does not create a missing key). Also note this creates a mismatch in the size of the .groups
and the .indices
(as the latter does not include the empty reference)
This creates unexpected behaviour when you try to loop over the groups later (and have more groups than you’d expect).
This is also not particularly easy to handle on the user end (if we do not wish to reconstruct the groupby object). Instead I have resorted to looping over the grouped.indices.keys()
and the using get_group
to avoid the empty group.
Expected Behavior
The empty group would not be created with the pd.Grouper
Installed Versions
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.10.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252
pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 61.2.0 pip : 22.1.2 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.1.1 matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
Issue Analytics
- State:
- Created a year ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
I’m running into this too. I think a concrete example might help to set expectations. Take the following data in
data.csv
(yoinked from a hacker news data set)I can group this dataset (that consists of two lines) using the following snippet.
There will be 916 results because every day is included between the two days that actually exist. All but 2 of those 916 results will just be empty data frames That is definitely not what I expected to happen.
I’m not sure what was meant by a “more resampling approach was used”, but the documentation definitely makes it seem like this will just group the data by date and it doesn’t mention creating every possible date as well. But even if it did mention creating every possible date, why would anyone want that? I must be very confused about the purpose of the API if that behavior is desirable. At best I could see it being a no-op for cases where you just use aggregation functions and the math happens to work out such that this wasn’t an issue.
@jreback any thoughts if this should be treated as a bug or a documentation issue (enhancement)?