BUG: Lazily-evaluated DataFrame not computing Index properly
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 12.2.1
- Modin version (
modin.__version__
): 0.15.3 - Python version: 3.9.12
- Code we can use to reproduce:
import modin.pandas as pd
import numpy as np
import decimal
df = pd.DataFrame(np.random.uniform(0.0,30.0,size=(60000,13))).add_prefix("col")
df1 = df[df['col0'] < 6.0].copy()
# This fails
df1['col0'] = df1['col0'].apply(lambda x: decimal.Decimal(str(x)))
Describe the problem
get_indices
returns an empty list instead of an empty pandas.Index
which leads to a bug in certain cases.
Source code / logs
Error logs
AttributeError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 df1['col0'] = df1['col0'].apply(lambda x: decimal.Decimal(str(x)))
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/pandas/dataframe.py:2517, in DataFrame.__setitem__(self, key, value)
2515 if isinstance(value, Series):
2516 value = value._query_compiler
-> 2517 self._update_inplace(self._query_compiler.setitem(0, key, value))
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:2234, in PandasQueryCompiler.setitem(self, axis, key, value)
2233 def setitem(self, axis, key, value):
-> 2234 return self._setitem(axis=axis, key=key, value=value, how=None)
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:2296, in PandasQueryCompiler._setitem(self, axis, key, value, how)
2294 value = value.transpose()
2295 idx = self.get_axis(axis ^ 1).get_indexer_for([key])[0]
-> 2296 return self.insert_item(axis ^ 1, idx, value, how, replace=True)
2298 # TODO: rework by passing list-like values to `apply_select_indices`
2299 # as an item to distribute
2300 if is_list_like(value):
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/storage_formats/base/query_compiler.py:3122, in BaseQueryCompiler.insert_item(self, axis, loc, value, how, replace)
3120 second_mask_loc = loc + 1 if replace else loc
3121 second_mask = mask(list(range(second_mask_loc, len(self.get_axis(axis)))))
-> 3122 return first_mask.concat(axis, [value, second_mask], join=how, sort=False)
3123 else:
3124 return self.concat(axis, [value], join=how, sort=False)
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:351, in PandasQueryCompiler.concat(self, axis, other, **kwargs)
349 ignore_index = kwargs.get("ignore_index", False)
350 other_modin_frame = [o._modin_frame for o in other]
--> 351 new_modin_frame = self._modin_frame.concat(axis, other_modin_frame, join, sort)
352 result = self.__constructor__(new_modin_frame)
353 if ignore_index:
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:124, in lazy_metadata_decorator.<locals>.decorator.<locals>.run_f_on_minimally_updated_metadata(self, *args, **kwargs)
122 elif apply_axis == "rows":
123 obj._propagate_index_objs(axis=0)
--> 124 result = f(self, *args, **kwargs)
125 if apply_axis is None and not transpose:
126 result._deferred_index = self._deferred_index
File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:2786, in PandasDataframe.concat(self, axis, others, how, sort)
2779 new_widths = _compute_new_widths()
2780 else:
2781 (
2782 left_parts,
2783 right_parts,
2784 joined_index,
2785 partition_sizes_along_axis,
-> 2786 ) = self._copartition(
2787 axis.value ^ 1, others, how, sort, force_repartition=False
2788 )
2789 if axis == Axis.COL_WISE:
2790 new_lengths = partition_sizes_along_axis
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:2571, in PandasDataframe._copartition(self, axis, other, how, sort, force_repartition)
2569 self_index = self.axes[axis]
2570 others_index = [o.axes[axis] for o in other]
-> 2571 joined_index, make_reindexer = self._join_index_objects(
2572 axis, [self_index] + others_index, how, sort
2573 )
2575 frames = [self] + other
2576 non_empty_frames_idx = [
2577 i for i, o in enumerate(frames) if o._partitions.size != 0
2578 ]
File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
113 """
114 Compute function with logging if Modin logging is enabled.
115
(...)
125 Any
126 """
127 if LogMode.get() == "disable":
--> 128 return obj(*args, **kwargs)
130 logger = get_logger()
131 logger_level = getattr(logger, log_level)
File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1445, in PandasDataframe._join_index_objects(axis, indexes, how, sort)
1442 return left_index.join(right_index, how=how, sort=sort)
1444 # define condition for joining indexes
-> 1445 all_indices_equal = all(indexes[0].equals(index) for index in indexes[1:])
1446 do_join_index = how is not None and not all_indices_equal
1448 # define condition for joining indexes with getting indexers
File ~/Documents/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1445, in <genexpr>(.0)
1442 return left_index.join(right_index, how=how, sort=sort)
1444 # define condition for joining indexes
-> 1445 all_indices_equal = all(indexes[0].equals(index) for index in indexes[1:])
1446 do_join_index = how is not None and not all_indices_equal
1448 # define condition for joining indexes with getting indexers
AttributeError: 'list' object has no attribute 'equals'
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Possible bug when checking an empty dataframe #5761
Is this a bug? Is there a way to check if the categoric_df is empty or not without computing it? (I want it...
Read more >Common Mistakes to Avoid when Using Dask
Dask evaluates lazily for a reason. Lazy evaluation allows Dask to postpone figuring out how to get you the result until the last...
Read more >How to force Spark to evaluate DataFrame operations inline
No. You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of...
Read more >Dask - How to handle large dataframes in python using ...
It's because Dask Dataframes are lazy and do not perform operations unless necessary. You can use the head() method to visualize data
Read more >What's new in 1.3.0 (July 2, 2021)
Many features of the Styler class are now either partially or fully usable on a DataFrame with a non-unique indexes or columns (GH41143)....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It’s like that! Very good explanation.
If we want to calculate indexes lazily, we should not filter out empty dataframes, from which we can then calculate indexes. This is what I tried to do in https://github.com/modin-project/modin/pull/4951.
Example:
Let’s take a look at the smaller reproducer:
So in that last line, we are doing a setitem to assign the column a new value. When we look at a backtrace for where things are going wrong we can see the sequence at some point:
modin/core/storage_formats/base/query_compiler.py:insert_item
creates the first_mask around line 3119, which calls a function called maskgetitem_column_array
, which we find the implementation for in the pandas query_compiler.getitem_column_array
callstake_2d_labels_or_positional
which can be found in the implementation of PandasDataframetake_2d_labels_or_positional
will call_take_2d_positional
_take_2d_positional
creates a new modin frame based on a subset of the partitions. What’s happening here is that the new_partitions list ends up being empty, and we set new_index to be None since we are looking at the index_cache (which is None from the copy). This causes us to have a fully empty Index which causes the issue of returning an empty DataFrame.Notes: I tried playing around with the new_partitions that are created and it seems that the empty list of new_partitions is intended. I guess this PR: https://github.com/modin-project/modin/pull/4911, prevented the recomputation of the index which would’ve avoided the situation in the first place. So perhaps the root cause is related to something here?