Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

df.groupby('index_column_name') results in a key error. In pandas it doesn't

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): I use arch btw
Modin version (modin.__version__): 0.8.0
Python version: 3.8
Code we can use to reproduce:

Pandas (works fine):

import pandas as pd
from pandas import util
df= util.testing.makeMixedDataFrame()
print(df.head())
df = df.to_numpy()
df = pd.DataFrame(df)
df.columns = ["A", "B", "C", "D"]
df = df.set_index("C")
df.groupby("C")

Modin (gives error):

import modin.pandas as pd
from pandas import util
df= util.testing.makeMixedDataFrame()
print(df.head())
df = df.to_numpy()
df = pd.DataFrame(df)
df.columns = ["A", "B", "C", "D"]
df = df.set_index("C")
df.groupby("C") #  <- Key Error! (scroll below)

Describe the problem

Using groupby with index column name in Pandas does not give a key error and works fine. However in Modin this results in a key error

Source code / logs

KeyError                                  Traceback (most recent call last)
 in 
      7 df.columns = ["A", "B", "C", "D"]
      8 df = df.set_index("C")
----> 9 df.groupby("C")

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
    434                 pass
    435             else:
--> 436                 by = self.__getitem__(by)._query_compiler
    437         elif isinstance(by, Series):
    438             drop = by._parent is self

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/base.py in __getitem__(self, key)
   3458             return self._getitem_slice(indexer)
   3459         else:
-> 3460             return self._getitem(key)
   3461 
   3462     def _getitem_slice(self, key):

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in _getitem(self, key)
   2422             # return self._getitem_multilevel(key)
   2423         else:
-> 2424             return self._getitem_column(key)
   2425 
   2426     def _getitem_column(self, key):

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in _getitem_column(self, key)
   2426     def _getitem_column(self, key):
   2427         if key not in self.keys():
-> 2428             raise KeyError("{}".format(key))
   2429         s = DataFrame(
   2430             query_compiler=self._query_compiler.getitem_column_array([key])

KeyError: 'C'

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

YarShevcommented, Sep 23, 2020

I reopened the issue because the fix doesn’t fully resolve the problem.

import modin.pandas as pd
import pandas
from pandas import util
df = util.testing.makeMixedDataFrame()
df = df.to_numpy()
pdf = pandas.DataFrame(df)
mdf = pd.DataFrame(df)
pdf.columns = ["A", "B", "C", "D"]
mdf.columns = ["A", "B", "C", "D"]
pdf = pdf.set_index("C")
mdf = mdf.set_index("C")
# data frames are equal so far
pdf
      A  B          D
C
foo1  0  0 2009-01-01
foo2  1  1 2009-01-02
foo3  2  0 2009-01-05
foo4  3  1 2009-01-06
foo5  4  0 2009-01-07

mdf
      A  B          D
C
foo1  0  0 2009-01-01
foo2  1  1 2009-01-02
foo3  2  0 2009-01-05
foo4  3  1 2009-01-06
foo5  4  0 2009-01-07

# However, number of groups is not equal
pdf.groupby("C").groups
{'foo1': ['foo1'], 'foo2': ['foo2'], 'foo3': ['foo3'], 'foo4': ['foo4'], 'foo5': ['foo5']

mdf.groupby("C").groups
{'C': ['foo1']}

# `sum` operation is performed incorrectly
pdf.groupby("C").sum()
        A    B
C
foo1  0.0  0.0
foo2  1.0  1.0
foo3  2.0  0.0
foo4  3.0  1.0
foo5  4.0  0.0

mdf.groupby("C").sum()
Empty DataFrame
Columns: []
Index: []

0reactions

yangyxtcommented, Dec 6, 2022

Ran into the issue today that I feel like this has not been completely resolved.

Using modin 0.15.1

basically is df = mpd.DataFrame({“A”:[1,2,3], “B”:[1,4,5]})

running this: df.groupby(“A”).apply(lambda x: x.loc[:, “B”])

would give a KeyError on “B”