Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: reset_index after a group_by raise a ValueError for empty dataframe

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import datetime as dt

df = pd.DataFrame([(dt.date.today(), "b", 12)], columns=["date", "b", "count"])
df["date"] = pd.to_datetime(df["date"])
df = df[df["count"] == 1]  # uncomment this line to make the dataframe empty and so reset_index raising an exception
df2 = df.set_index('date').groupby(['b']).resample('M').sum().reset_index()

Issue Description

When you run the code above, df won’t be empty and the code will run correctly. But if you uncomment the line that makes the dataframe empty, the reset index will raise a ValueError(f"cannot insert b, already exists").

This is a regression as it was working in pandas 1.2.x

Expected Behavior

It’s expected for the reset_index to apply correctly on an empty dataframe too.

Installed Versions

INSTALLED VERSIONS

commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Mon Aug 30 06:12:21 PDT 2021; root:xnu-7195.141.6~3/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.3.3 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.1.2 setuptools : 57.0.0 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.1 (dt dec pq3 ext lo64) jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.25 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

ahmedibrhmcommented, Jul 13, 2022

I don’t think this is a bug but the normal/expected behavior. The reporter have a DataFrame that’s equal to

Date       b  count
7/11/2022 "b"  12

When we apply def function(df): df.set_index('date').groupby(['b']).resample('M').sum() The result will be

              count
Date       b
7/11/2022 "b"  12

What we notice here is that column b is dropped because the data is a string not numeric. In this case, we can apply reset_index and it will return the dataframe to its original shape. What if the DataFrame is as the following

Date       b  count
7/11/2022  13  12

The result of calling the function will be

                b  count
Date       b
7/11/2022  13   13  12

Thus, the b will exist as both Column and Index. Thus, it will cause an error when calling reset_index

The function apply will assume that all of the columns can be numeric when calling the function on the DataFrame when it’s empty. Thus, the expected results should contain b as both Column and Index. Thus, calling reset_index after it will cause an error as b is both an index and a column. This behavior is better than dropping the columns that are empty because it will cause the empty data frame to lose all of the columns when applying resample. This was the old behavior that was fixed by #39940 that’s why it seems like a regression, but I believe that this isn’t a regression/bug and the behavior is consistent.

We can edit the method reset_index when dealing with empty data frames. In this case, reset_index can just ignore the indexes that are already in the columns and only return to the column the indexes that are not there. Then delete the remaining duplicated indexes. – only when the df is empty. However, this will have inconsistent behavior when passing a df that has duplicate keys as index and column. When it’s empty it will return a different df shape than when it has entries. Thus, I don’t recommend doing this edit.

0reactions

rhshadrachcommented, Jul 15, 2022

Thanks for the assessment here @ahmedibrhm! In addition, as part of #46560, the silent dropping of column “b” when it is non-empty will raise in 2.0. So in pandas 2.0, both the empty and non-empty versions of the OP example will raise. This is an improvement because it removes the value-dependent behavior.

Top Results From Across the Web

Pandas unable to reset index because name exist

The issue is that you have levels in your index with the same name. You can just rename them so that they're unique:...

How to reset index after Groupby pandas? - GeeksforGeeks

In order to reset the index after groupby() we will use the reset_index() function.

What's New — pandas 0.23.4 documentation - PyData |

Bug in indexing a datetimelike Index that raised ValueError instead of IndexError (GH18386). Index.to_series() now accepts index and name kwargs (GH18699) ...

Solve Pandas "ValueError: cannot reindex from a duplicate axis"

Apparently, the python error is the result of doing operations on a DataFrame that has duplicate index values. Operations that require unique ...

KeyError Pandas – How To Fix - Data Independent

Preferred Option: Make sure that your column label (or row label) is in your dataframe! · Error catch option: Use df.get('your column') to...