Series.drop() is 10x slower on Modin than on Pandas
See original GitHub issueSystem information
- OS Platform and Distribution: MacBook Pro (16-inch, 2019) with MacOS BigSur 11.5.2
- Memory: 16 GB 2667 MHz DDR4
- Modin version (
modin.__version__
): 0.12.0+40.g7e85c5df - Python version: 3.8.8
- Code we can use to reproduce:
import modin.pandas as pd
matches = pd.Series([0] * 16000000)
%time repr(matches.drop([0]))
%time repr(matches.drop(matches.index))
Describe the problem
I am using Ray with 16 partitions, and Dask with 12 workers.
From the above snippet, dropping the first entry in the series takes 9.64 seconds on modin[ray], 9.93 seconds on modin[dask], and .433 seconds on pandas. Dropping the entire series takes 24.4 seconds on modin[ray], 24.9 seconds on modin[dask], and 2.12 seconds on pandas.
This bug came up when I was investigating this StackOverflow question.
Source code / logs
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Modin df iterrows is painfully slow. Any alternative to speed it ...
Let me explain - I'm merging 5 datasets one by one. After EACH merge this above operation happens. Total records after 5 merges...
Read more >How To Make Your Pandas Loop 71803 Times Faster
Looping through Pandas DataFrames can be very slow — I will show you some very fast options. If you use Python and Pandas...
Read more >How to Speed up Pandas by 4x with one line of code
Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU ...
Read more >How to Speedup Pandas with One-Line change using Modin
In this article, we are going to see how to increase the speed of computation of the pandas using modin library. Modin is...
Read more >Pandas — Optimize Memory and Speed Operation
Result: Drop from 38.1 MB to 9.5 MB in Memory usage i.e. 75% reduction ... Image 04 — Pandas and Modin read_csv comparison...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
There’s probably something else to improve in mask. cc @dchigarev
I checked
Series.drop
with different data sizes to find the main bottlenecks. All results are for Ray engine, 112 workers.Code
The results for this is the next:
modin hotspot 1 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/base.py#L1192
modin hotspot 2 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/core/storage_formats/pandas/query_compiler.py#L2280-L2291
Code
The results for this is the next:
modin hotspot 1 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/core/storage_formats/pandas/query_compiler.py#L2280-L2291
modin hotspot 2 is: https://github.com/modin-project/modin/blob/c17dde71ae8114723a13279e36bf5532dcba328a/modin/core/dataframe/pandas/dataframe/dataframe.py#L628-L630
The same hotspot in
self._get_dict_of_block_index
is observed in https://github.com/modin-project/modin/issues/4268.