question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Can't do binary operations between dataframes with virtual partitions

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.4 (Apple Silicon)
  • Modin version (modin.__version__): latest master (cc3bdb)
  • Python version: 3.9.13
  • Code we can use to reproduce:
import ray
ray.init()
import modin.pandas as pd
s1 = 13
s2 = 13
df1 = pd.concat([pd.DataFrame([i]) for i in range(s1)])
df2 = pd.concat([pd.DataFrame([i]) for i in range(s2)])
print(df1 + df2)

If either s1 or s2 parameter is 12 or smaller, this error doesn’t occur; the error also doesn’t occur when s1=14 and s2=12 (I have not tried binary searching more thoroughly). The error also doesn’t occur when the dataframes are constructed from a single list (such as df1 = pd.DataFrame([i for i in range(13)])).

Describe the problem

On Ray, for dataframes with a certain number of partitions, attempting to add them together seems to cause something in the Modin codebase to try to treat a logical (column) partition as a physical (block) partition by accessing its _data field. In the given code, df1._query_compiler._modin_frame._partitions.shape is (7, 1).

This may be related to an existing issue since it may be an issue with virtual partition construction, but I’m unsure what the precise root cause is.

Source code / logs

Stack trace
2022-07-20 14:53:39,547	INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1
UserWarning: Distributing <class 'list'> object. This may take some time.
Traceback (most recent call last):
  File "/Users/jhshi/code/modin/repros/new.py", line 8, in <module>
    print(df1 + df2)
  File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/Users/jhshi/code/modin/modin/pandas/dataframe.py", line 536, in add
    return self._binary_op(
  File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/Users/jhshi/code/modin/modin/pandas/base.py", line 397, in _binary_op
    new_query_compiler = getattr(self._query_compiler, op)(other, **kwargs)
  File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/Users/jhshi/code/modin/modin/core/dataframe/algebra/binary.py", line 92, in caller
    query_compiler._modin_frame.binary_op(
  File "/Users/jhshi/code/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 115, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2531, in binary_op
    else self._partition_mgr_cls.binary_operation(
  File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 55, in magic
    result_parts = f(*args, **kwargs)
  File "/Users/jhshi/code/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 413, in binary_operation
    return super(PandasOnRayDataframePartitionManager, cls).binary_operation(
  File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1290, in binary_operation
    [
  File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1291, in <listcomp>
    [
  File "/Users/jhshi/code/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1294, in <listcomp>
    right[row_idx][col_idx]._data,
AttributeError: 'PandasOnRayDataframeColumnPartition' object has no attribute '_data'

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
mvashishthacommented, Sep 22, 2022

I’m marking this as P0 because I think it’s a significant bug, and it’s a regression that virtual partitioning introduced near the beginning of 2022.

0reactions
pyritocommented, Aug 31, 2022

Looks like @prutskov left a PR open before he left. We should try to get the fix merged in if possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
Column(s) to use as the row labels of the DataFrame , either given as string name or column index. If a sequence of...
Read more >
Spark SQL, DataFrames and Datasets Guide
DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs....
Read more >
4. Working with Key/Value Pairs - Learning Spark [Book]
Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two...
Read more >
Releases · modin-project/modin - GitHub
It also includes many bug fixes and some performance enhancements. ... FIX-#4691: Fix binary operations between virtual partitions (#5049) ...
Read more >
DynamicFrame class - AWS Glue
You can convert DynamicFrames to and from DataFrames after you resolve any schema ... jdf – A reference to the data frame in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found