BUG: reset_index after a group_by raise a ValueError for empty dataframe
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
import datetime as dt
df = pd.DataFrame([(dt.date.today(), "b", 12)], columns=["date", "b", "count"])
df["date"] = pd.to_datetime(df["date"])
df = df[df["count"] == 1] # uncomment this line to make the dataframe empty and so reset_index raising an exception
df2 = df.set_index('date').groupby(['b']).resample('M').sum().reset_index()
Issue Description
When you run the code above, df won’t be empty and the code will run correctly. But if you uncomment the line that makes the dataframe empty, the reset index will raise a ValueError(f"cannot insert b, already exists").
This is a regression as it was working in pandas 1.2.x
Expected Behavior
It’s expected for the reset_index to apply correctly on an empty dataframe too.
Installed Versions
INSTALLED VERSIONS
commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.3.3 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.2 setuptools : 57.0.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.1 (dt dec pq3 ext lo64) jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.25 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
I don’t think this is a bug but the normal/expected behavior. The reporter have a
DataFrame
that’s equal toWhen we apply
def function(df): df.set_index('date').groupby(['b']).resample('M').sum()
The result will beWhat we notice here is that
column b
is dropped because the data is a string not numeric. In this case, we can applyreset_index
and it will return thedataframe
to its original shape. What if the DataFrame is as the followingThe result of calling the function will be
Thus, the
b
will exist as both Column and Index. Thus, it will cause an error when callingreset_index
The function apply will assume that all of the columns can be numeric when calling the
function
on theDataFrame
when it’s empty. Thus, the expected results should containb
as both Column and Index. Thus, callingreset_index
after it will cause an error asb
is both an index and a column. This behavior is better than dropping the columns that are empty because it will cause the empty data frame to lose all of the columns when applyingresample
. This was the old behavior that was fixed by #39940 that’s why it seems like a regression, but I believe that this isn’t a regression/bug and the behavior is consistent.We can edit the method
reset_index
when dealing with empty data frames. In this case,reset_index
can just ignore the indexes that are already in the columns and only return to the column the indexes that are not there. Then delete the remaining duplicated indexes. – only when the df is empty. However, this will have inconsistent behavior when passing a df that has duplicate keys as index and column. When it’s empty it will return a different df shape than when it has entries. Thus, I don’t recommend doing this edit.Thanks for the assessment here @ahmedibrhm! In addition, as part of #46560, the silent dropping of column “b” when it is non-empty will raise in 2.0. So in pandas 2.0, both the empty and non-empty versions of the OP example will raise. This is an improvement because it removes the value-dependent behavior.