question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Lazily-evaluated DataFrame not computing Index properly

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 12.2.1
  • Modin version (modin.__version__): 0.15.3
  • Python version: 3.9.12
  • Code we can use to reproduce:
import modin.pandas as pd
import numpy as np
import decimal

df = pd.DataFrame(np.random.uniform(0.0,30.0,size=(60000,13))).add_prefix("col")
df1 = df[df['col0'] < 6.0].copy()

# This fails
df1['col0'] = df1['col0'].apply(lambda x: decimal.Decimal(str(x)))

Describe the problem

get_indices returns an empty list instead of an empty pandas.Index which leads to a bug in certain cases.

Source code / logs

Error logs
AttributeError                            Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df1['col0'] = df1['col0'].apply(lambda x: decimal.Decimal(str(x)))

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/pandas/dataframe.py:2517, in DataFrame.__setitem__(self, key, value)
   2515 if isinstance(value, Series):
   2516     value = value._query_compiler
-> 2517 self._update_inplace(self._query_compiler.setitem(0, key, value))

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:2234, in PandasQueryCompiler.setitem(self, axis, key, value)
   2233 def setitem(self, axis, key, value):
-> 2234     return self._setitem(axis=axis, key=key, value=value, how=None)

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:2296, in PandasQueryCompiler._setitem(self, axis, key, value, how)
   2294         value = value.transpose()
   2295     idx = self.get_axis(axis ^ 1).get_indexer_for([key])[0]
-> 2296     return self.insert_item(axis ^ 1, idx, value, how, replace=True)
   2298 # TODO: rework by passing list-like values to `apply_select_indices`
   2299 # as an item to distribute
   2300 if is_list_like(value):

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/storage_formats/base/query_compiler.py:3122, in BaseQueryCompiler.insert_item(self, axis, loc, value, how, replace)
   3120     second_mask_loc = loc + 1 if replace else loc
   3121     second_mask = mask(list(range(second_mask_loc, len(self.get_axis(axis)))))
-> 3122     return first_mask.concat(axis, [value, second_mask], join=how, sort=False)
   3123 else:
   3124     return self.concat(axis, [value], join=how, sort=False)

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:351, in PandasQueryCompiler.concat(self, axis, other, **kwargs)
    349 ignore_index = kwargs.get("ignore_index", False)
    350 other_modin_frame = [o._modin_frame for o in other]
--> 351 new_modin_frame = self._modin_frame.concat(axis, other_modin_frame, join, sort)
    352 result = self.__constructor__(new_modin_frame)
    353 if ignore_index:

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:124, in lazy_metadata_decorator.<locals>.decorator.<locals>.run_f_on_minimally_updated_metadata(self, *args, **kwargs)
    122     elif apply_axis == "rows":
    123         obj._propagate_index_objs(axis=0)
--> 124 result = f(self, *args, **kwargs)
    125 if apply_axis is None and not transpose:
    126     result._deferred_index = self._deferred_index

File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:2786, in PandasDataframe.concat(self, axis, others, how, sort)
   2779     new_widths = _compute_new_widths()
   2780 else:
   2781     (
   2782         left_parts,
   2783         right_parts,
   2784         joined_index,
   2785         partition_sizes_along_axis,
-> 2786     ) = self._copartition(
   2787         axis.value ^ 1, others, how, sort, force_repartition=False
   2788     )
   2789     if axis == Axis.COL_WISE:
   2790         new_lengths = partition_sizes_along_axis

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:2571, in PandasDataframe._copartition(self, axis, other, how, sort, force_repartition)
   2569 self_index = self.axes[axis]
   2570 others_index = [o.axes[axis] for o in other]
-> 2571 joined_index, make_reindexer = self._join_index_objects(
   2572     axis, [self_index] + others_index, how, sort
   2573 )
   2575 frames = [self] + other
   2576 non_empty_frames_idx = [
   2577     i for i, o in enumerate(frames) if o._partitions.size != 0
   2578 ]

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1445, in PandasDataframe._join_index_objects(axis, indexes, how, sort)
   1442         return left_index.join(right_index, how=how, sort=sort)
   1444 # define condition for joining indexes
-> 1445 all_indices_equal = all(indexes[0].equals(index) for index in indexes[1:])
   1446 do_join_index = how is not None and not all_indices_equal
   1448 # define condition for joining indexes with getting indexers

File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1445, in <genexpr>(.0)
   1442         return left_index.join(right_index, how=how, sort=sort)
   1444 # define condition for joining indexes
-> 1445 all_indices_equal = all(indexes[0].equals(index) for index in indexes[1:])
   1446 do_join_index = how is not None and not all_indices_equal
   1448 # define condition for joining indexes with getting indexers

AttributeError: 'list' object has no attribute 'equals'

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
anmyachevcommented, Sep 9, 2022

It’s like that! Very good explanation.

If we want to calculate indexes lazily, we should not filter out empty dataframes, from which we can then calculate indexes. This is what I tried to do in https://github.com/modin-project/modin/pull/4951.

Example:

>>> df
   col1  col2
0     1   1.3
1     2   2.5
2     3   2.9
3     4   1.6
>>> df.iloc[:, []]
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
>>> df.iloc[:, []].index
RangeIndex(start=0, stop=4, step=1)
>>> df.iloc[[], :]
Empty DataFrame
Columns: [col1, col2]
Index: []
>>> df.iloc[[], :].columns
Index(['col1', 'col2'], dtype='object')
1reaction
pyritocommented, Sep 9, 2022

Let’s take a look at the smaller reproducer:

import modin.pandas as pd
import numpy as np

df = pd.DataFrame({"col0": [0,1]})
df1 = df[df['col0'] < 6.0].copy()

# This returns an empty dataframe
df1['col0'] = df1['col0'].apply(lambda x: x+1)

So in that last line, we are doing a setitem to assign the column a new value. When we look at a backtrace for where things are going wrong we can see the sequence at some point:

  1. modin/core/storage_formats/base/query_compiler.py:insert_item creates the first_mask around line 3119, which calls a function called mask
  2. This function, mask, calls getitem_column_array, which we find the implementation for in the pandas query_compiler.
  3. getitem_column_array calls take_2d_labels_or_positional which can be found in the implementation of PandasDataframe
  4. take_2d_labels_or_positional will call _take_2d_positional
  5. This is where things get a little hazy to me. _take_2d_positional creates a new modin frame based on a subset of the partitions. What’s happening here is that the new_partitions list ends up being empty, and we set new_index to be None since we are looking at the index_cache (which is None from the copy). This causes us to have a fully empty Index which causes the issue of returning an empty DataFrame.

Notes: I tried playing around with the new_partitions that are created and it seems that the empty list of new_partitions is intended. I guess this PR: https://github.com/modin-project/modin/pull/4911, prevented the recomputation of the index which would’ve avoided the situation in the first place. So perhaps the root cause is related to something here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Possible bug when checking an empty dataframe #5761
Is this a bug? Is there a way to check if the categoric_df is empty or not without computing it? (I want it...
Read more >
Common Mistakes to Avoid when Using Dask
Dask evaluates lazily for a reason. Lazy evaluation allows Dask to postpone figuring out how to get you the result until the last...
Read more >
How to force Spark to evaluate DataFrame operations inline
No. You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of...
Read more >
Dask - How to handle large dataframes in python using ...
It's because Dask Dataframes are lazy and do not perform operations unless necessary. You can use the head() method to visualize data
Read more >
What's new in 1.3.0 (July 2, 2021)
Many features of the Styler class are now either partially or fully usable on a DataFrame with a non-unique indexes or columns (GH41143)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found