BUG: Can't do binary operations between dataframes with virtual partitions
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.4 (Apple Silicon)
- Modin version (
modin.__version__): latest master (cc3bdb) - Python version: 3.9.13
- Code we can use to reproduce:
import ray
ray.init()
import modin.pandas as pd
s1 = 13
s2 = 13
df1 = pd.concat([pd.DataFrame([i]) for i in range(s1)])
df2 = pd.concat([pd.DataFrame([i]) for i in range(s2)])
print(df1 + df2)
If either s1 or s2 parameter is 12 or smaller, this error doesn’t occur; the error also doesn’t occur when s1=14 and s2=12 (I have not tried binary searching more thoroughly). The error also doesn’t occur when the dataframes are constructed from a single list (such as df1 = pd.DataFrame([i for i in range(13)])).
Describe the problem
On Ray, for dataframes with a certain number of partitions, attempting to add them together seems to cause something in the Modin codebase to try to treat a logical (column) partition as a physical (block) partition by accessing its _data field. In the given code, df1._query_compiler._modin_frame._partitions.shape is (7, 1).
This may be related to an existing issue since it may be an issue with virtual partition construction, but I’m unsure what the precise root cause is.
Source code / logs
Stack trace
2022-07-20 14:53:39,547 INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1
UserWarning: Distributing <class 'list'> object. This may take some time.
Traceback (most recent call last):
File "/Users/jhshi/code/modin/repros/new.py", line 8, in <module>
print(df1 + df2)
File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/Users/jhshi/code/modin/modin/pandas/dataframe.py", line 536, in add
return self._binary_op(
File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/Users/jhshi/code/modin/modin/pandas/base.py", line 397, in _binary_op
new_query_compiler = getattr(self._query_compiler, op)(other, **kwargs)
File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/Users/jhshi/code/modin/modin/core/dataframe/algebra/binary.py", line 92, in caller
query_compiler._modin_frame.binary_op(
File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
return obj(*args, **kwargs)
File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2531, in binary_op
else self._partition_mgr_cls.binary_operation(
File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 55, in magic
result_parts = f(*args, **kwargs)
File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 413, in binary_operation
return super(PandasOnRayDataframePartitionManager, cls).binary_operation(
File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1290, in binary_operation
[
File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1291, in <listcomp>
[
File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1294, in <listcomp>
right[row_idx][col_idx]._data,
AttributeError: 'PandasOnRayDataframeColumnPartition' object has no attribute '_data'
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)

Top Related StackOverflow Question
I’m marking this as P0 because I think it’s a significant bug, and it’s a regression that virtual partitioning introduced near the beginning of 2022.
Looks like @prutskov left a PR open before he left. We should try to get the fix merged in if possible.