question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Can't pass modin functions to modin functions that take a callable parameter

See original GitHub issue

System information

OS X 11.6.4 Modin version ‘0.15.2’ Python 3.9.12

Describe the problem

When I try to use DataFrame.apply with DataFrame.sample, modin throws an error. Code runs without issues with pandas.

Source code / logs

>>> import modin.pandas as pd
>>> import modin.config as cfg
>>> cfg.Engine.put('Python')
>>> x = pd.DataFrame(data={"a":[1,2,3], "b":[1,2,3], "c":[1,2,3]})
UserWarning: Distributing <class 'dict'> object. This may take some time.
>>> x = x.set_index(['a'])
>>> x.groupby('a', group_keys=False).apply(pd.DataFrame.sample, n=1)
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 550, in _get_axis_number
    return cls._AXIS_TO_AXIS_NUMBER[axis]
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2805, in groupby_agg_builder
    return compute_groupby(df, drop, partition_idx)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2754, in compute_groupby
    result = partition_agg_func(grouped_df, *agg_args, **agg_kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1423, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1464, in _python_apply_general
    values, mutated = self.grouper.apply(f, data, self.axis)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 761, in apply
    res = f(group)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1397, in f
    return func(g, *args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/utils.py", line 521, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/base.py", line 2567, in sample
    axis = self._get_axis_number(axis)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 552, in _get_axis_number
    raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
ValueError: No axis named None for object type DataFrame

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 550, in _get_axis_number
    return cls._AXIS_TO_AXIS_NUMBER[axis]
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/groupby.py", line 327, in apply
    self._wrap_aggregation(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/groupby.py", line 1082, in _wrap_aggregation
    query_compiler=qc_method(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2813, in groupby_agg
    new_modin_frame = self._modin_frame.broadcast_apply_full_axis(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2330, in broadcast_apply_full_axis
    new_partitions = self._partition_mgr_cls.broadcast_axis_partitions(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 463, in broadcast_axis_partitions
    [
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 464, in <listcomp>
    left_partitions[i].apply(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 99, in apply
    return self._wrap_partitions(self.deploy_axis_func(*args, **kwargs))
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 162, in deploy_axis_func
    result = func(dataframe, *kwargs.pop("args", ()), **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 1393, in _tree_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2815, in <lambda>
    func=lambda df, by=None, partition_idx=None: groupby_agg_builder(
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2809, in groupby_agg_builder
    return compute_groupby(df.copy(), drop, partition_idx)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2754, in compute_groupby
    result = partition_agg_func(grouped_df, *agg_args, **agg_kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1423, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1464, in _python_apply_general
    values, mutated = self.grouper.apply(f, data, self.axis)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 761, in apply
    res = f(group)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1397, in f
    return func(g, *args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/utils.py", line 521, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/logging/logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/modin/pandas/base.py", line 2567, in sample
    axis = self._get_axis_number(axis)
  File "/usr/local/Caskroom/miniforge/base/envs/enobase3/lib/python3.9/site-packages/pandas/core/generic.py", line 552, in _get_axis_number
    raise ValueError(f"No axis named {axis} for object type {cls.__name__}")
ValueError: No axis named None for object type DataFrame

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
mvashishthacommented, Jul 12, 2022

@pyrito this looks like a distinct issue from #4165, which looks like a duplicate of #3435: in those issues, the bug is that the function in the apply can’t access data from other column partitions. Here, the problem is, as you said, that apply gets applied to the inner pandas dataframe rather than to a Modin dataframe. Given that Modin is meant to be a drop-in replacement for pandas, I think it’s reasonable for users to expect apply functions like modin.pandas.DataFrame.sample to work as well as pandas.DataFrame.sample.

Maybe for now we can find a way to replace functions like modin.pandas.DataFrame.sample with the pandas equivalents. I don’t know a good way to do this, though. @modin-project/modin-core @modin-project/modin-contributors is that reasonable?

Note that this problem isn’t limited to DataFrameGroupBy.Apply. For example, we can apply pandas.Series.sum on each column of a pandas dataframe, but we can’t apply the modin Series sum on each column of a Modin dataframe. The following script works at pandas 1.4.3 but fails at Modin 05933a5f27fb96f5a7ff6025ae2573d033a31b11 if I replace pandas as pd with modin.pandas as pd:

import pandas as pd
df = pd.DataFrame([1])
print(df.apply(pd.Series.sum))
1reaction
RehanSDcommented, Jul 12, 2022

So I wouldn’t say it’s “defaulting to pandas” in this case - the functions need to be from pandas in order to work, but will still be parallel since they’ll be applied to partitions. In fact, this is how Modin implements many functions eg count or sum - we map the pandas function across the partitions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

IO Module Description - Modin - Read the Docs
Parse data on each partition. Parameters. *args (list) – Positional arguments to be passed to the callback function. **kwargs (dict) ...
Read more >
modin/test_io.py at master
Helper function to test `to_parquet` method. Parameters. ----------. modin_obj : pd.DataFrame. A Modin DataFrame or a Series to test `to_parquet` method.
Read more >
Training (tune.Trainable, session.report) — Ray 2.2.0
You can instead pass the object refs to the training function via the config or use Python partials. Parameters. trainable – Trainable to...
Read more >
Pandas read_csv() - How to read a csv file in Python
Explains different ways pandas read_csv function can be used to read csv files into ... usecols parameter can also take callable functions.
Read more >
Dataframe Systems: Theory, Architecture, and Implementation
and working with Areg helped make Modin successful. ... The function argument is a callable function that accepts a row and outputs multiple....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found