Relabeling Modin Frame loses partitions shape cache
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
- Modin version (
modin.__version__
): a3ddf2f - Python version: 3.7.5
- Code we can use to reproduce:
import modin.pandas as pd
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
modin_frame = df._query_compiler._modin_frame
src_cache = modin_frame._partitions[0][0]._length_cache
modin_frame.columns = ["c", "d"]
relabeled_cache = modin_frame._partitions[0][0]._length_cache
assert (
src_cache == relabeled_cache
), f"src: {src_cache} | relabeled: {relabeled_cache}"
Output:
Traceback (most recent call last):
File "tst3.py", line 13, in <module>
), f"src: {src_cache} | relabeled: {relabeled_cache}"
AssertionError: src: 3 | relabeled: None
Describe the problem
Setting new axis labels should not change partitioning, so the shape of partitions and so its cache should be untouched.
The cache is losing because propagating new labels into partitions (performed by PandasFrame.synchronize_labels
) done just like regular function applying:
https://github.com/modin-project/modin/blob/a3ddf2f01163a312416d2a8bc456ba9582ae9b4d/modin/engines/base/frame/data.py#L360-L368
This doesn’t give any hints to partition that the shape will be unchanged, so it resets the shape cache.
It seems that there should be some mechanism (for example different apply-function, or a parameter for the existed one) telling partition that it can preserve shape cache.
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (18 by maintainers)
Top Results From Across the Web
Try width and length caches before materializing all partition ...
Each frame had a transpose on the queue. Executing a multiply then caused the widths to be computed serially, so each partition's call...
Read more >Optimization Notes — Modin 0+untagged.50.g2ebc9cf.dirty ...
Modin uses a partitioning scheme that partitions a dataframe along both axes, resulting in ... np.array(unwrap_partitions(df)).shape print( f"The frame has ...
Read more >modin Changelog - PyUp.io
Ensure relabeling Modin Frame does not lose partition shape (3c740db) * Update `Series.values` to default to `to_numpy()` (67228ef)
Read more >VLDB 2022: Paper Sessions - Keynote Speakers
In this paper, we present a simple partitioned Bloom filter that works ... most memory accesses happen in L2 cache without losing precision....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@YarShev that is what should happen in this commit. In master, all 19 columns will be on one worker because minimum size is 32.
I think #3662 fixed this.