Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`GroupBy.getitem` does not include 'by' columns into resulted object which leads to `KeyError`

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
Modin version (modin.__version__):
Python version: 3.7.5
Code we can use to reproduce:

import modin.pandas as pd

df = pd.DataFrame(
    {
        "a": [1, 1, 2, 3],
        "b": [3, 4, 5, 6],
        "c": [7, 8, 9, 10],
        "d": [11, 12, 13, 14],
    }
)

grp = df.groupby("a")
grp_view_columns = grp[["b", "c"]]
grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)

grp = df.groupby(by=["a", [1, 1, 2, 2]])
grp_view_columns = grp[["b", "c"]]
repr(grp_view_columns.mean()) # KeyError: 'a' (modin aggregation is failed)

Tracebacks

First exception:

UserWarning: `DataFrame.groupby_on_multiple_columns` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
FutureWarning: The `squeeze` parameter is deprecated and will be removed in a future version.
Traceback (most recent call last):
  File "t3.py", line 14, in <module>
    grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)
  File "modin/pandas/groupby.py", line 438, in aggregate
    **kwargs,
  File "modin/pandas/groupby.py", line 985, in _default_to_pandas
    return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
  File "modin/pandas/base.py", line 459, in _default_to_pandas
    result = op(pandas_obj, *args, **kwargs)
  File "modin/pandas/groupby.py", line 979, in groupby_on_multiple_columns
    by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
    dropna=dropna,
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
    dropna=self.dropna,
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
    raise KeyError(gpr)
KeyError: 'a'

Second exception:

ray::deploy_ray_func() (pid=1790987, ip=10.125.130.30)
  File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 497, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/ray/pandas_on_ray/frame/axis_partition.py", line 207, in deploy_ray_func
    result = func(*args)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/axis_partition.py", line 303, in deploy_axis_func
    result = func(dataframe, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1153, in _map_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2845, in <lambda>
    df, by, drop, partition_idx
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2838, in groupby_agg_builder
    return compute_groupby(df.copy(), drop, partition_idx)
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2742, in compute_groupby
    grouped_df = df.groupby(by=by, axis=axis, **groupby_kwargs)
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
    dropna=dropna,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
    dropna=self.dropna,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
    raise KeyError(gpr)
KeyError: 'a'

Describe the problem

The problem that causes these KeyErrors is that the slice of groupby object (produced by __getitem__) contains the source _df without “by” columns, only with the cols, specified by the key argument: https://github.com/modin-project/modin/blob/e561a516217abbf630e89f794c838fb878c30a5f/modin/pandas/groupby.py#L353-L360

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

dchigarevcommented, Sep 3, 2021

@straiffix, you’re right, your problem is directly caused by this issue. This bug is now being fixed in #3298 PR and the fix is more likely to be a part of the next release.

I’ve tried your reproducer on the #3298 branch and it works fine (for except defaulting to pandas on groupby.rolling)

As a workaround for now you may use:

corr = modin_data[["Column_1", "Column_2", "Column_3"]].groupby("Column_1").rolling(26, min_periods=4).corr()

0reactions

straiffixcommented, Sep 3, 2021

I got an error that could be possibly related to the issue.

OS Platform and Distribution: Linux Ubuntu 16.04.7
Modin version (modin.__version__): 0.10.2
Python version: 3.8.9

cor = modin_data.groupby(['Column_1'])[['Column_2', 'Column_3']].rolling(26, min_periods=4).corr()

Same code works with pandas, but not modin.

Traceback

KeyError                                  Traceback (most recent call last)
~/model/process.py in <module>
----> 1 cor = modin_data.groupby(['Column_1'])[['Column_2', 'Column_3']].rolling(26, min_periods=4).corr()

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in rolling(self, *args, **kwargs)
    673 
    674     def rolling(self, *args, **kwargs):
--> 675         return self._default_to_pandas(lambda df: df.rolling(*args, **kwargs))
    676 
    677     def hist(self):

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in _default_to_pandas(self, f, *args, **kwargs)
    971             )
    972 
--> 973         return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
    974 
    975 

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/base.py in _default_to_pandas(self, op, *args, **kwargs)
    457         pandas_obj = self._to_pandas()
    458         if callable(op):
--> 459             result = op(pandas_obj, *args, **kwargs)
    460         elif isinstance(op, str):
    461             # The inner `getattr` is ensuring that we are treating this object (whether

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in groupby_on_multiple_columns(df, *args, **kwargs)
    964         def groupby_on_multiple_columns(df, *args, **kwargs):
    965             return f(
--> 966                 df.groupby(
    967                     by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
    968                 ),

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7624         # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7625         # "Union[bool, NoDefault]"; expected "bool"
-> 7626         return DataFrameGroupBy(
   7627             obj=self,
   7628             keys=by,

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    886             from pandas.core.groupby.grouper import get_grouper
    887 
--> 888             grouper, exclusions, obj = get_grouper(
    889                 obj,
    890                 keys,

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    858                 in_axis, level, gpr = False, gpr, None
    859             else:
--> 860                 raise KeyError(gpr)
    861         elif isinstance(gpr, Grouper) and gpr.key is not None:
    862             # Add key to exclusions

KeyError: 'Column_1'

Top Results From Across the Web

Python pandas groupby key error in pandas.hashtable ...

The column is a string column with no NaN's or weird strings. However, I keep getting the below error. Does anyone know why...

df.groupby('index_column_name') results in a key error. In ...

This was an easy fix, treating the index name as a column name for the purpose of groupby. Pandas is not consistent with...

KeyError Pandas – How To Fix - Data Independent

Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...

What's new in 1.3.0 (July 2, 2021) - Pandas

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510)....

How to Fix: KeyError in Pandas - GeeksforGeeks

Usually, this error occurs when you misspell a column/row name or include an unwanted space before or after the column/row name.