PERF: `mul` operator does not partition numpy arrays
See original GitHub issueCurrently, the mul
operator takes in any other
as an input and passes it into the Binary operator. However, we can see some subtle performance bugs when we pass in large numpy arrays since they are not partitioned the way Modin DataFrame or Series objects are. Thus, we pay the cost of reduced parallelism and repeated serialization overhead. See the motivating example below:
Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:27:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import modin.pandas as pd
In [2]: import decimal
In [3]: import numpy as np
In [4]: from modin.config import BenchmarkMode
In [5]: BenchmarkMode.put(True)
In [6]: M = pd.DataFrame(np.random.random_sample(size=(120, 60000))).applymap(lambda x:decimal.Decimal(str(x)))
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
2022-08-22 14:16:54,281 INFO services.py:1456 -- View the Ray dashboard at http://127.0.0.1:8265
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
In [7]: weights = np.random.random_sample(size=(60000,))
In [8]: weights = np.array(list(map(lambda x:decimal.Decimal(str(x)), weights)))
In [9]: %time M.multiply(weights)
CPU times: user 321 ms, sys: 53.7 ms, total: 375 ms
Wall time: 6.68 s
Out[9]:
0 1 ... 59998 59999
0 0.001871229680293184835132216529 0.1125595064461222127546447633 ... 0.1894143166633470435557906745 0.2349543683059415639697880109
1 0.3074624673719783109968605559 0.09273838904128218152586477092 ... 0.2029095456386146031138205756 0.1616853327195407084614832868
2 0.3897266694333883313273041400 0.2537830212539442242265080434 ... 0.8107787876704090106591548030 0.1229755346018885526882067748
3 0.2583124125587996502235699321 0.4498781243078582914813668500 ... 0.8564954919189455941558239659 0.09211689235883030752298349492
4 0.1061179072404943051562615619 0.3336578922164111842903271783 ... 0.1005454296723310086843963626 0.1874810569578940435377035713
.. ... ... ... ... ...
115 0.3949792275157328411261277589 0.6010145155847959067546547836 ... 0.5291756827557823994811168265 0.1325158507275452006359707929
116 0.07929250218996140697968032465 0.6395289718186264522090778522 ... 0.5738096098591226134330773518 0.1578424076909477723581652143
117 0.3138466101406778500488460671 0.6178395114502829387080024035 ... 0.2733019957925933811079534807 0.1701258997545075302035196599
118 0.1354218463660404257217389635 0.5474920319305188067014257354 ... 0.4398039382550085069756538372 0.01630106384776699496716116182
119 0.4149047206459432673825734424 0.1166747756703353400708336362 ... 0.08773573746534161304912106592 0.09971152436634388742225979974
[120 rows x 60000 columns]
In [10]: weights = pd.Series(weights)
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
In [11]: %time M.multiply(weights)
CPU times: user 103 ms, sys: 42.3 ms, total: 146 ms
Wall time: 2.51 s
Out[11]:
0 1 ... 59998 59999
0 0.001871229680293184835132216529 0.1125595064461222127546447633 ... 0.1894143166633470435557906745 0.2349543683059415639697880109
1 0.3074624673719783109968605559 0.09273838904128218152586477092 ... 0.2029095456386146031138205756 0.1616853327195407084614832868
2 0.3897266694333883313273041400 0.2537830212539442242265080434 ... 0.8107787876704090106591548030 0.1229755346018885526882067748
3 0.2583124125587996502235699321 0.4498781243078582914813668500 ... 0.8564954919189455941558239659 0.09211689235883030752298349492
4 0.1061179072404943051562615619 0.3336578922164111842903271783 ... 0.1005454296723310086843963626 0.1874810569578940435377035713
.. ... ... ... ... ...
115 0.3949792275157328411261277589 0.6010145155847959067546547836 ... 0.5291756827557823994811168265 0.1325158507275452006359707929
116 0.07929250218996140697968032465 0.6395289718186264522090778522 ... 0.5738096098591226134330773518 0.1578424076909477723581652143
117 0.3138466101406778500488460671 0.6178395114502829387080024035 ... 0.2733019957925933811079534807 0.1701258997545075302035196599
118 0.1354218463660404257217389635 0.5474920319305188067014257354 ... 0.4398039382550085069756538372 0.01630106384776699496716116182
119 0.4149047206459432673825734424 0.1166747756703353400708336362 ... 0.08773573746534161304912106592 0.09971152436634388742225979974
[120 rows x 60000 columns]
As we see here, we observe a 2.66x speedup when multiplying with a Modin Series object instead of a numpy array. One solution I can think of is to default creating a Modin Series/DataFrame object for specific objects. The downside of this approach is that we would have larger overheads for smaller objects, so we might need to do something smarter here.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
numpy.partition — NumPy v1.24 Manual
When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be...
Read more >Python NUMPY HUGE Matrices multiplication - Stack Overflow
I currently have an EC2 box with 4GB RAM and no swap. I was wondering if this operation can be serialized and I...
Read more >NumPy Matrix Multiplication — np.matmul() and @ [Ultimate ...
As the name suggests, this computes the dot product of two vectors. It takes two arguments – the arrays you would like to...
Read more >How to Index, Slice and Reshape NumPy Arrays for Machine ...
It is common to split your loaded data into input variables (X) and the output variable (y). We can do this by slicing...
Read more >Array Oriented Programming with Python NumPy | by Semi Koen
Numpy arrays are stored in row-major order — i.e. the flattened memory is represented row-by-row. As such, storing data in a contiguous block...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@pyrito one more thought is that if the user uses a particular numpy array as a right multiplication operand multiple times, we’ll have to construct a modin series each time. I don’t think there’s a way to fix that problem internally, so maybe we should raise a warning when we convert the array to Modin.
To clarify the difference between in performance between np array and series as the right operand:
For np array: call
apply_full_axis
and serialize the whole vector once for every row partition of the left operand:For modin series: broadcast right column partitions to every left block that they need to go to:
(via
binary_op
):https://github.com/modin-project/modin/blob/bd326f1c4175102489f08d271a53cf374bd9125e/modin/core/dataframe/algebra/binary.py#L97-L103
@pyrito please correct me if I’m wrong.
From your options, I prefer (1). In this case, even if we partitioned the right operand into smaller objects in the main node as in (2), we’d have to serialize each partition of the right operand once for every row partition of the left operand. Serializing everything once by making the whole vector a modin dataframe means we can pass around references to the partitions and deserialize each right partition multiple times if needed. (I am assuming here that there’s no copartitioning needed-- if we do need copartitioning, we have to pay for it.
I would guess we can get away with not gating on the size of the object for now. For smaller objects, creating a Modin series should not be too expensive either. Also, we are already serializing the entire right operand anyway when we pass it as an argument to ray tasks.