`GroupBy.__getitem__` does not include 'by' columns into resulted object which leads to `KeyError`
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
- Modin version (
modin.__version__
): - Python version: 3.7.5
- Code we can use to reproduce:
import modin.pandas as pd
df = pd.DataFrame(
{
"a": [1, 1, 2, 3],
"b": [3, 4, 5, 6],
"c": [7, 8, 9, 10],
"d": [11, 12, 13, 14],
}
)
grp = df.groupby("a")
grp_view_columns = grp[["b", "c"]]
grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)
grp = df.groupby(by=["a", [1, 1, 2, 2]])
grp_view_columns = grp[["b", "c"]]
repr(grp_view_columns.mean()) # KeyError: 'a' (modin aggregation is failed)
Tracebacks
First exception:
UserWarning: `DataFrame.groupby_on_multiple_columns` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
FutureWarning: The `squeeze` parameter is deprecated and will be removed in a future version.
Traceback (most recent call last):
File "t3.py", line 14, in <module>
grp_view_columns.agg(["mean"]) # KeyError: 'a' (failed to default to pandas)
File "modin/pandas/groupby.py", line 438, in aggregate
**kwargs,
File "modin/pandas/groupby.py", line 985, in _default_to_pandas
return self._df._default_to_pandas(groupby_on_multiple_columns, *args, **kwargs)
File "modin/pandas/base.py", line 459, in _default_to_pandas
result = op(pandas_obj, *args, **kwargs)
File "modin/pandas/groupby.py", line 979, in groupby_on_multiple_columns
by=by, axis=self._axis, squeeze=self._squeeze, **self._kwargs
File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
dropna=dropna,
File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
dropna=self.dropna,
File "miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
raise KeyError(gpr)
KeyError: 'a'
Second exception:
ray::deploy_ray_func() (pid=1790987, ip=10.125.130.30)
File "python/ray/_raylet.pyx", line 490, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 497, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 330, in _function_with_tracing
return function(*args, **kwargs)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/ray/pandas_on_ray/frame/axis_partition.py", line 207, in deploy_ray_func
result = func(*args)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/axis_partition.py", line 303, in deploy_axis_func
result = func(dataframe, **kwargs)
File "/localdisk/dchigare/repos/modin_bp/modin/engines/base/frame/data.py", line 1153, in _map_reduce_func
series_result = func(df, *args, **kwargs)
File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2845, in <lambda>
df, by, drop, partition_idx
File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2838, in groupby_agg_builder
return compute_groupby(df.copy(), drop, partition_idx)
File "/localdisk/dchigare/repos/modin_bp/modin/backends/pandas/query_compiler.py", line 2742, in compute_groupby
grouped_df = df.groupby(by=by, axis=axis, **groupby_kwargs)
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/frame.py", line 7636, in groupby
dropna=dropna,
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 896, in __init__
dropna=self.dropna,
File "/localdisk/dchigare/miniconda3/envs/modin_tests/lib/python3.7/site-packages/pandas/core/groupby/grouper.py", line 860, in get_grouper
raise KeyError(gpr)
KeyError: 'a'
Describe the problem
The problem that causes these KeyError
s is that the slice of groupby object (produced by __getitem__
) contains the source _df
without “by” columns, only with the cols, specified by the key
argument:
https://github.com/modin-project/modin/blob/e561a516217abbf630e89f794c838fb878c30a5f/modin/pandas/groupby.py#L353-L360
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Python pandas groupby key error in pandas.hashtable ...
The column is a string column with no NaN's or weird strings. However, I keep getting the below error. Does anyone know why...
Read more >df.groupby('index_column_name') results in a key error. In ...
This was an easy fix, treating the index name as a column name for the purpose of groupby. Pandas is not consistent with...
Read more >KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >What's new in 1.3.0 (July 2, 2021) - Pandas
When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510)....
Read more >How to Fix: KeyError in Pandas - GeeksforGeeks
Usually, this error occurs when you misspell a column/row name or include an unwanted space before or after the column/row name.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@straiffix, you’re right, your problem is directly caused by this issue. This bug is now being fixed in #3298 PR and the fix is more likely to be a part of the next release.
I’ve tried your reproducer on the #3298 branch and it works fine (for except defaulting to pandas on
groupby.rolling
)As a workaround for now you may use:
I got an error that could be possibly related to the issue.
Same code works with pandas, but not modin.
Traceback