question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Relabeling Modin Frame loses partitions shape cache

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): a3ddf2f
  • Python version: 3.7.5
  • Code we can use to reproduce:
import modin.pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

modin_frame = df._query_compiler._modin_frame
src_cache = modin_frame._partitions[0][0]._length_cache

modin_frame.columns = ["c", "d"]
relabeled_cache = modin_frame._partitions[0][0]._length_cache

assert (
    src_cache == relabeled_cache
), f"src: {src_cache} | relabeled: {relabeled_cache}"

Output:

Traceback (most recent call last):
  File "tst3.py", line 13, in <module>
    ), f"src: {src_cache} | relabeled: {relabeled_cache}"
AssertionError: src: 3 | relabeled: None

Describe the problem

Setting new axis labels should not change partitioning, so the shape of partitions and so its cache should be untouched.

The cache is losing because propagating new labels into partitions (performed by PandasFrame.synchronize_labels) done just like regular function applying: https://github.com/modin-project/modin/blob/a3ddf2f01163a312416d2a8bc456ba9582ae9b4d/modin/engines/base/frame/data.py#L360-L368 This doesn’t give any hints to partition that the shape will be unchanged, so it resets the shape cache.

It seems that there should be some mechanism (for example different apply-function, or a parameter for the existed one) telling partition that it can preserve shape cache.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
devin-petersohncommented, Jun 15, 2021

@YarShev that is what should happen in this commit. In master, all 19 columns will be on one worker because minimum size is 32.

0reactions
mvashishthacommented, Sep 1, 2022

I think #3662 fixed this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Try width and length caches before materializing all partition ...
Each frame had a transpose on the queue. Executing a multiply then caused the widths to be computed serially, so each partition's call...
Read more >
Optimization Notes — Modin 0+untagged.50.g2ebc9cf.dirty ...
Modin uses a partitioning scheme that partitions a dataframe along both axes, resulting in ... np.array(unwrap_partitions(df)).shape print( f"The frame has ...
Read more >
modin Changelog - PyUp.io
Ensure relabeling Modin Frame does not lose partition shape (3c740db) * Update `Series.values` to default to `to_numpy()` (67228ef)
Read more >
VLDB 2022: Paper Sessions - Keynote Speakers
In this paper, we present a simple partitioned Bloom filter that works ... most memory accesses happen in L2 cache without losing precision....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found