question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`GroupBy.__getitem__` does not include 'by' columns into resulted object which leads to `KeyError`

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__):
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd

df = pd.DataFrame(
    {
        "a": [1, 1, 2, 3],
        "b": [3, 4, 5, 6],
        "c": [7, 8, 9, 10],
        "d": [11, 12, 13, 14],
    }
)

grp = df.groupby("a")
grp_view_columns = grp[["b", "c"]]
grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)

grp = df.groupby(by=["a", [1, 1, 2, 2]])
grp_view_columns = grp[["b", "c"]]
repr(grp_view_columns.mean()) # KeyError: 'a' (modin aggregation is failed)

Tracebacks

First exception:

UserWarning: `DataFrame.groupby_on_multiple_columns` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
FutureWarning: The `squeeze` parameter is deprecated and will be removed in a future version.
Traceback (most recent call last):
  File "t3.py", line 14, in <module>
    grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)
  File "modin/pandas/groupby.py", line 438, in aggregate
    **kwargs,
  File "modin/pandas/groupby.py", line 985, in _default_to_pandas
    return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
  File "modin/pandas/base.py", line 459, in _default_to_pandas
    result = op(pandas_obj, *args, **kwargs)
  File "modin/pandas/groupby.py", line 979, in groupby_on_multiple_columns
    by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
    dropna=dropna,
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
    dropna=self.dropna,
  File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
    raise KeyError(gpr)
KeyError: 'a'

Second exception:

ray::deploy_ray_func() (pid=1790987, ip=10.125.130.30)
  File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 497, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
    return function(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/ray/pandas_on_ray/frame/axis_partition.py", line 207, in deploy_ray_func
    result = func(*args)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/axis_partition.py", line 303, in deploy_axis_func
    result = func(dataframe, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1153, in _map_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2845, in <lambda>
    df, by, drop, partition_idx
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2838, in groupby_agg_builder
    return compute_groupby(df.copy(), drop, partition_idx)
  File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2742, in compute_groupby
    grouped_df = df.groupby(by=by, axis=axis, **groupby_kwargs)
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
    dropna=dropna,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
    dropna=self.dropna,
  File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
    raise KeyError(gpr)
KeyError: 'a'

Describe the problem

The problem that causes these KeyErrors is that the slice of groupby object (produced by __getitem__) contains the source _df without “by” columns, only with the cols, specified by the key argument: https://github.com/modin-project/modin/blob/e561a516217abbf630e89f794c838fb878c30a5f/modin/pandas/groupby.py#L353-L360

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
dchigarevcommented, Sep 3, 2021

@straiffix, you’re right, your problem is directly caused by this issue. This bug is now being fixed in #3298 PR and the fix is more likely to be a part of the next release.

I’ve tried your reproducer on the #3298 branch and it works fine (for except defaulting to pandas on groupby.rolling)

As a workaround for now you may use:

corr = modin_data[["Column_1", "Column_2", "Column_3"]].groupby("Column_1").rolling(26, min_periods=4).corr()
0reactions
straiffixcommented, Sep 3, 2021

I got an error that could be possibly related to the issue.

OS Platform and Distribution: Linux Ubuntu 16.04.7
Modin version (modin.__version__): 0.10.2
Python version: 3.8.9
cor = modin_data.groupby(['Column_1'])[['Column_2', 'Column_3']].rolling(26, min_periods=4).corr()

Same code works with pandas, but not modin.

Traceback

KeyError                                  Traceback (most recent call last)
~/model/process.py in <module>
----> 1 cor = modin_data.groupby(['Column_1'])[['Column_2', 'Column_3']].rolling(26, min_periods=4).corr()

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in rolling(self, *args, **kwargs)
    673 
    674     def rolling(self, *args, **kwargs):
--> 675         return self._default_to_pandas(lambda df: df.rolling(*args, **kwargs))
    676 
    677     def hist(self):

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in _default_to_pandas(self, f, *args, **kwargs)
    971             )
    972 
--> 973         return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
    974 
    975 

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/base.py in _default_to_pandas(self, op, *args, **kwargs)
    457         pandas_obj = self._to_pandas()
    458         if callable(op):
--> 459             result = op(pandas_obj, *args, **kwargs)
    460         elif isinstance(op, str):
    461             # The inner `getattr` is ensuring that we are treating this object (whether

~/CLV/.env/clv_env/lib/python3.8/site-packages/modin/pandas/groupby.py in groupby_on_multiple_columns(df, *args, **kwargs)
    964         def groupby_on_multiple_columns(df, *args, **kwargs):
    965             return f(
--> 966                 df.groupby(
    967                     by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
    968                 ),

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   7624         # error: Argument "squeeze" to "DataFrameGroupBy" has incompatible type
   7625         # "Union[bool, NoDefault]"; expected "bool"
-> 7626         return DataFrameGroupBy(
   7627             obj=self,
   7628             keys=by,

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    886             from pandas.core.groupby.grouper import get_grouper
    887 
--> 888             grouper, exclusions, obj = get_grouper(
    889                 obj,
    890                 keys,

~/CLV/.env/clv_env/lib/python3.8/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    858                 in_axis, level, gpr = False, gpr, None
    859             else:
--> 860                 raise KeyError(gpr)
    861         elif isinstance(gpr, Grouper) and gpr.key is not None:
    862             # Add key to exclusions

KeyError: 'Column_1'

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python pandas groupby key error in pandas.hashtable ...
The column is a string column with no NaN's or weird strings. However, I keep getting the below error. Does anyone know why...
Read more >
df.groupby('index_column_name') results in a key error. In ...
This was an easy fix, treating the index name as a column name for the purpose of groupby. Pandas is not consistent with...
Read more >
KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >
What's new in 1.3.0 (July 2, 2021) - Pandas
When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510)....
Read more >
How to Fix: KeyError in Pandas - GeeksforGeeks
Usually, this error occurs when you misspell a column/row name or include an unwanted space before or after the column/row name.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found