Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: `mul` operator does not partition numpy arrays

See original GitHub issue

Currently, the mul operator takes in any other as an input and passes it into the Binary operator. However, we can see some subtle performance bugs when we pass in large numpy arrays since they are not partitioned the way Modin DataFrame or Series objects are. Thus, we pay the cost of reduced parallelism and repeated serialization overhead. See the motivating example below:

Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:27:05) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import modin.pandas as pd

In [2]: import decimal

In [3]: import numpy as np

In [4]: from modin.config import BenchmarkMode

In [5]: BenchmarkMode.put(True)

In [6]: M = pd.DataFrame(np.random.random_sample(size=(120, 60000))).applymap(lambda x:decimal.Decimal(str(x)))
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

2022-08-22 14:16:54,281	INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.

In [7]: weights = np.random.random_sample(size=(60000,))

In [8]: weights = np.array(list(map(lambda x:decimal.Decimal(str(x)), weights)))

In [9]: %time M.multiply(weights)
CPU times: user 321 ms, sys: 53.7 ms, total: 375 ms
Wall time: 6.68 s
Out[9]: 
                                0                                1      ...                            59998                            59999
0    0.001871229680293184835132216529   0.1125595064461222127546447633  ...   0.1894143166633470435557906745   0.2349543683059415639697880109
1      0.3074624673719783109968605559  0.09273838904128218152586477092  ...   0.2029095456386146031138205756   0.1616853327195407084614832868
2      0.3897266694333883313273041400   0.2537830212539442242265080434  ...   0.8107787876704090106591548030   0.1229755346018885526882067748
3      0.2583124125587996502235699321   0.4498781243078582914813668500  ...   0.8564954919189455941558239659  0.09211689235883030752298349492
4      0.1061179072404943051562615619   0.3336578922164111842903271783  ...   0.1005454296723310086843963626   0.1874810569578940435377035713
..                                ...                              ...  ...                              ...                              ...
115    0.3949792275157328411261277589   0.6010145155847959067546547836  ...   0.5291756827557823994811168265   0.1325158507275452006359707929
116   0.07929250218996140697968032465   0.6395289718186264522090778522  ...   0.5738096098591226134330773518   0.1578424076909477723581652143
117    0.3138466101406778500488460671   0.6178395114502829387080024035  ...   0.2733019957925933811079534807   0.1701258997545075302035196599
118    0.1354218463660404257217389635   0.5474920319305188067014257354  ...   0.4398039382550085069756538372  0.01630106384776699496716116182
119    0.4149047206459432673825734424   0.1166747756703353400708336362  ...  0.08773573746534161304912106592  0.09971152436634388742225979974

[120 rows x 60000 columns]

In [10]: weights = pd.Series(weights)
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.

In [11]: %time M.multiply(weights)
CPU times: user 103 ms, sys: 42.3 ms, total: 146 ms
Wall time: 2.51 s
Out[11]: 
                                0                                1      ...                            59998                            59999
0    0.001871229680293184835132216529   0.1125595064461222127546447633  ...   0.1894143166633470435557906745   0.2349543683059415639697880109
1      0.3074624673719783109968605559  0.09273838904128218152586477092  ...   0.2029095456386146031138205756   0.1616853327195407084614832868
2      0.3897266694333883313273041400   0.2537830212539442242265080434  ...   0.8107787876704090106591548030   0.1229755346018885526882067748
3      0.2583124125587996502235699321   0.4498781243078582914813668500  ...   0.8564954919189455941558239659  0.09211689235883030752298349492
4      0.1061179072404943051562615619   0.3336578922164111842903271783  ...   0.1005454296723310086843963626   0.1874810569578940435377035713
..                                ...                              ...  ...                              ...                              ...
115    0.3949792275157328411261277589   0.6010145155847959067546547836  ...   0.5291756827557823994811168265   0.1325158507275452006359707929
116   0.07929250218996140697968032465   0.6395289718186264522090778522  ...   0.5738096098591226134330773518   0.1578424076909477723581652143
117    0.3138466101406778500488460671   0.6178395114502829387080024035  ...   0.2733019957925933811079534807   0.1701258997545075302035196599
118    0.1354218463660404257217389635   0.5474920319305188067014257354  ...   0.4398039382550085069756538372  0.01630106384776699496716116182
119    0.4149047206459432673825734424   0.1166747756703353400708336362  ...  0.08773573746534161304912106592  0.09971152436634388742225979974

[120 rows x 60000 columns]

As we see here, we observe a 2.66x speedup when multiplying with a Modin Series object instead of a numpy array. One solution I can think of is to default creating a Modin Series/DataFrame object for specific objects. The downside of this approach is that we would have larger overheads for smaller objects, so we might need to do something smarter here.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mvashishthacommented, Aug 23, 2022

@pyrito one more thought is that if the user uses a particular numpy array as a right multiplication operand multiple times, we’ll have to construct a modin series each time. I don’t think there’s a way to fix that problem internally, so maybe we should raise a warning when we convert the array to Modin.

1reaction

mvashishthacommented, Aug 22, 2022

To clarify the difference between in performance between np array and series as the right operand:

For np array: call apply_full_axis and serialize the whole vector once for every row partition of the left operand:

For modin series: broadcast right column partitions to every left block that they need to go to:

(via binary_op):

https://github.com/modin-project/modin/blob/bd326f1c4175102489f08d271a53cf374bd9125e/modin/core/dataframe/algebra/binary.py#L97-L103

@pyrito please correct me if I’m wrong.

From your options, I prefer (1). In this case, even if we partitioned the right operand into smaller objects in the main node as in (2), we’d have to serialize each partition of the right operand once for every row partition of the left operand. Serializing everything once by making the whole vector a modin dataframe means we can pass around references to the partitions and deserialize each right partition multiple times if needed. (I am assuming here that there’s no copartitioning needed-- if we do need copartitioning, we have to pay for it.

How feasible would it be to check the size of an object in Python and determine whether it’s worth keeping the object the same or create a Modin Series/DataFrame?

I would guess we can get away with not gating on the size of the object for now. For smaller objects, creating a Modin series should not be too expensive either. Also, we are already serializing the entire right operand anyway when we pass it as an argument to ray tasks.