BUG: Can't pass modin functions to modin functions that take a callable parameter
See original GitHub issueSystem information
OS X 11.6.4 Modin version ‘0.15.2’ Python 3.9.12
Describe the problem
When I try to use DataFrame.apply
with DataFrame.sample
, modin throws an error. Code runs without issues with pandas.
Source code / logs
>>> import modin.pandas as pd
>>> import modin.config as cfg
>>> cfg.Engine.put('Python')
>>> x = pd.DataFrame(data={"a":[1,2,3], "b":[1,2,3], "c":[1,2,3]})
UserWarning: Distributing <class 'dict'> object. This may take some time.
>>> x = x.set_index(['a'])
>>> x.groupby('a', group_keys=False).apply(pd.DataFrame.sample, n=1)
Traceback (most recent call last):
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 550, in _get_axis_number
return cls._AXIS_TO_AXIS_NUMBER[axis]
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2805, in groupby_agg_builder
return compute_groupby(df, drop, partition_idx)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2754, in compute_groupby
result = partition_agg_func(grouped_df, *agg_args, **agg_kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1423, in apply
result = self._python_apply_general(f, self._selected_obj)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1464, in _python_apply_general
values, mutated = self.grouper.apply(f, data, self.axis)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 761, in apply
res = f(group)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1397, in f
return func(g, *args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/utils.py", line 521, in wrapper
result = func(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/base.py", line 2567, in sample
axis = self._get_axis_number(axis)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 552, in _get_axis_number
raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
ValueError: No axis named None for object type DataFrame
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 550, in _get_axis_number
return cls._AXIS_TO_AXIS_NUMBER[axis]
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/groupby.py", line 327, in apply
self._wrap_aggregation(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/groupby.py", line 1082, in _wrap_aggregation
query_compiler=qc_method(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2813, in groupby_agg
new_modin_frame = self._modin_frame.broadcast_apply_full_axis(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2330, in broadcast_apply_full_axis
new_partitions = self._partition_mgr_cls.broadcast_axis_partitions(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 463, in broadcast_axis_partitions
[
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 464, in <listcomp>
left_partitions[i].apply(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 99, in apply
return self._wrap_partitions(self.deploy_axis_func(*args, **kwargs))
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 162, in deploy_axis_func
result = func(dataframe, *kwargs.pop("args", ()), **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 1393, in _tree_reduce_func
series_result = func(df, *args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2815, in <lambda>
func=lambda df, by=None, partition_idx=None: groupby_agg_builder(
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2809, in groupby_agg_builder
return compute_groupby(df.copy(), drop, partition_idx)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2754, in compute_groupby
result = partition_agg_func(grouped_df, *agg_args, **agg_kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1423, in apply
result = self._python_apply_general(f, self._selected_obj)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1464, in _python_apply_general
values, mutated = self.grouper.apply(f, data, self.axis)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 761, in apply
res = f(group)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1397, in f
return func(g, *args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/utils.py", line 521, in wrapper
result = func(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
return method(*args, **kwargs)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/base.py", line 2567, in sample
axis = self._get_axis_number(axis)
File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 552, in _get_axis_number
raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
ValueError: No axis named None for object type DataFrame
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
IO Module Description - Modin - Read the Docs
Parse data on each partition. Parameters. *args (list) – Positional arguments to be passed to the callback function. **kwargs (dict) ...
Read more >modin/test_io.py at master
Helper function to test `to_parquet` method. Parameters. ----------. modin_obj : pd.DataFrame. A Modin DataFrame or a Series to test `to_parquet` method.
Read more >Training (tune.Trainable, session.report) — Ray 2.2.0
You can instead pass the object refs to the training function via the config or use Python partials. Parameters. trainable – Trainable to...
Read more >Pandas read_csv() - How to read a csv file in Python
Explains different ways pandas read_csv function can be used to read csv files into ... usecols parameter can also take callable functions.
Read more >Dataframe Systems: Theory, Architecture, and Implementation
and working with Areg helped make Modin successful. ... The function argument is a callable function that accepts a row and outputs multiple....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@pyrito this looks like a distinct issue from #4165, which looks like a duplicate of #3435: in those issues, the bug is that the function in the
apply
can’t access data from other column partitions. Here, the problem is, as you said, thatapply
gets applied to the inner pandas dataframe rather than to a Modin dataframe. Given that Modin is meant to be a drop-in replacement for pandas, I think it’s reasonable for users to expectapply
functions likemodin.pandas.DataFrame.sample
to work as well aspandas.DataFrame.sample
.Maybe for now we can find a way to replace functions like
modin.pandas.DataFrame.sample
with the pandas equivalents. I don’t know a good way to do this, though. @modin-project/modin-core @modin-project/modin-contributors is that reasonable?Note that this problem isn’t limited to
DataFrameGroupBy.Apply
. For example, we can applypandas.Series.sum
on each column of a pandas dataframe, but we can’t apply the modin Series sum on each column of a Modin dataframe. The following script works at pandas 1.4.3 but fails at Modin 05933a5f27fb96f5a7ff6025ae2573d033a31b11 if I replacepandas as pd
withmodin.pandas as pd
:So I wouldn’t say it’s “defaulting to pandas” in this case - the functions need to be from pandas in order to work, but will still be parallel since they’ll be applied to partitions. In fact, this is how Modin implements many functions eg count or sum - we map the pandas function across the partitions.