BUG: Can't do binary operations between dataframes with virtual partitions
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.4 (Apple Silicon)
- Modin version (
modin.__version__
): latest master (cc3bdb) - Python version: 3.9.13
- Code we can use to reproduce:
import ray
ray.init()
import modin.pandas as pd
s1 = 13
s2 = 13
df1 = pd.concat([pd.DataFrame([i]) for i in range(s1)])
df2 = pd.concat([pd.DataFrame([i]) for i in range(s2)])
print(df1 + df2)
If either s1
or s2
parameter is 12 or smaller, this error doesn’t occur; the error also doesn’t occur when s1=14
and s2=12
(I have not tried binary searching more thoroughly). The error also doesn’t occur when the dataframes are constructed from a single list (such as df1 = pd.DataFrame([i for i in range(13)])
).
Describe the problem
On Ray, for dataframes with a certain number of partitions, attempting to add them together seems to cause something in the Modin codebase to try to treat a logical (column) partition as a physical (block) partition by accessing its _data
field. In the given code, df1._query_compiler._modin_frame._partitions.shape
is (7, 1)
.
This may be related to an existing issue since it may be an issue with virtual partition construction, but I’m unsure what the precise root cause is.
Source code / logs
Stack trace
2022-07-20 14:53:39,547 INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265 UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1 UserWarning: Distributing <class 'list'> object. This may take some time. Traceback (most recent call last): File "/Users/jhshi/code/modin/repros/new.py", line 8, in <module> print(df1 + df2) File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "/Users/jhshi/code/modin/modin/pandas/dataframe.py", line 536, in add return self._binary_op( File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "/Users/jhshi/code/modin/modin/pandas/base.py", line 397, in _binary_op new_query_compiler = getattr(self._query_compiler, op)(other, **kwargs) File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "/Users/jhshi/code/modin/modin/core/dataframe/algebra/binary.py", line 92, in caller query_compiler._modin_frame.binary_op( File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log return obj(*args, **kwargs) File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata result = f(self, *args, **kwargs) File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2531, in binary_op else self._partition_mgr_cls.binary_operation( File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 55, in magic result_parts = f(*args, **kwargs) File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 413, in binary_operation return super(PandasOnRayDataframePartitionManager, cls).binary_operation( File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1290, in binary_operation [ File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1291, in <listcomp> [ File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1294, in <listcomp> right[row_idx][col_idx]._data, AttributeError: 'PandasOnRayDataframeColumnPartition' object has no attribute '_data'
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
I’m marking this as P0 because I think it’s a significant bug, and it’s a regression that virtual partitioning introduced near the beginning of 2022.
Looks like @prutskov left a PR open before he left. We should try to get the fix merged in if possible.