question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

df.groupby('index_column_name') results in a key error. In pandas it doesn't

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): I use arch btw
  • Modin version (modin.__version__): 0.8.0
  • Python version: 3.8
  • Code we can use to reproduce:

Pandas (works fine):

import pandas as pd
from pandas import util
df= util.testing.makeMixedDataFrame()
print(df.head())
df = df.to_numpy()
df = pd.DataFrame(df)
df.columns = ["A", "B", "C", "D"]
df = df.set_index("C")
df.groupby("C")

Modin (gives error):

import modin.pandas as pd
from pandas import util
df= util.testing.makeMixedDataFrame()
print(df.head())
df = df.to_numpy()
df = pd.DataFrame(df)
df.columns = ["A", "B", "C", "D"]
df = df.set_index("C")
df.groupby("C") #  <- Key Error! (scroll below)

Describe the problem

Using groupby with index column name in Pandas does not give a key error and works fine. However in Modin this results in a key error

Source code / logs

KeyError                                  Traceback (most recent call last)
 in 
      7 df.columns = ["A", "B", "C", "D"]
      8 df = df.set_index("C")
----> 9 df.groupby("C")

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
    434                 pass
    435             else:
--> 436                 by = self.__getitem__(by)._query_compiler
    437         elif isinstance(by, Series):
    438             drop = by._parent is self

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/base.py in __getitem__(self, key)
   3458             return self._getitem_slice(indexer)
   3459         else:
-> 3460             return self._getitem(key)
   3461 
   3462     def _getitem_slice(self, key):

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in _getitem(self, key)
   2422             # return self._getitem_multilevel(key)
   2423         else:
-> 2424             return self._getitem_column(key)
   2425 
   2426     def _getitem_column(self, key):

~/anaconda3/envs/recnn/lib/python3.8/site-packages/modin/pandas/dataframe.py in _getitem_column(self, key)
   2426     def _getitem_column(self, key):
   2427         if key not in self.keys():
-> 2428             raise KeyError("{}".format(key))
   2429         s = DataFrame(
   2430             query_compiler=self._query_compiler.getitem_column_array([key])

KeyError: 'C'

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
YarShevcommented, Sep 23, 2020

I reopened the issue because the fix doesn’t fully resolve the problem.

import modin.pandas as pd
import pandas
from pandas import util
df = util.testing.makeMixedDataFrame()
df = df.to_numpy()
pdf = pandas.DataFrame(df)
mdf = pd.DataFrame(df)
pdf.columns = ["A", "B", "C", "D"]
mdf.columns = ["A", "B", "C", "D"]
pdf = pdf.set_index("C")
mdf = mdf.set_index("C")
# data frames are equal so far
pdf
      A  B          D
C
foo1  0  0 2009-01-01
foo2  1  1 2009-01-02
foo3  2  0 2009-01-05
foo4  3  1 2009-01-06
foo5  4  0 2009-01-07

mdf
      A  B          D
C
foo1  0  0 2009-01-01
foo2  1  1 2009-01-02
foo3  2  0 2009-01-05
foo4  3  1 2009-01-06
foo5  4  0 2009-01-07

# However, number of groups is not equal
pdf.groupby("C").groups
{'foo1': ['foo1'], 'foo2': ['foo2'], 'foo3': ['foo3'], 'foo4': ['foo4'], 'foo5': ['foo5']

mdf.groupby("C").groups
{'C': ['foo1']}

# `sum` operation is performed incorrectly
pdf.groupby("C").sum()
        A    B
C
foo1  0.0  0.0
foo2  1.0  1.0
foo3  2.0  0.0
foo4  3.0  1.0
foo5  4.0  0.0

mdf.groupby("C").sum()
Empty DataFrame
Columns: []
Index: []
0reactions
yangyxtcommented, Dec 6, 2022

Ran into the issue today that I feel like this has not been completely resolved.

Using modin 0.15.1

basically is df = mpd.DataFrame({“A”:[1,2,3], “B”:[1,4,5]})

running this: df.groupby(“A”).apply(lambda x: x.loc[:, “B”])

would give a KeyError on “B”

Read more comments on GitHub >

github_iconTop Results From Across the Web

KeyError from pandas DataFrame groupby - Stack Overflow
This is very important, as I opened, copied & saved the exactly content using Notepad++, and there won't be such problem with the...
Read more >
KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >
How to Fix KeyError in Pandas (With Example) - Statology
This error occurs when you attempt to access some column in a pandas DataFrame that does not exist. Typically this error occurs when...
Read more >
How to Fix: KeyError in Pandas - GeeksforGeeks
Pandas KeyError occurs when we try to access some column/row label in our DataFrame that doesn't exist. Usually, this error occurs when you ......
Read more >
What's new in 1.3.0 (July 2, 2021) - Pandas
Constructing a DataFrame or Series with the data argument being a Python iterable that is not a NumPy ndarray consisting of NumPy scalars...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found