question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Series.drop() is 10x slower on Modin than on Pandas

See original GitHub issue

System information

  • OS Platform and Distribution: MacBook Pro (16-inch, 2019) with MacOS BigSur 11.5.2
  • Memory: 16 GB 2667 MHz DDR4
  • Modin version (modin.__version__): 0.12.0+40.g7e85c5df
  • Python version: 3.8.8
  • Code we can use to reproduce:
import modin.pandas as pd
matches = pd.Series([0] * 16000000)
%time repr(matches.drop([0]))
%time repr(matches.drop(matches.index))

Describe the problem

I am using Ray with 16 partitions, and Dask with 12 workers.

From the above snippet, dropping the first entry in the series takes 9.64 seconds on modin[ray], 9.93 seconds on modin[dask], and .433 seconds on pandas. Dropping the entire series takes 24.4 seconds on modin[ray], 24.9 seconds on modin[dask], and 2.12 seconds on pandas.

This bug came up when I was investigating this StackOverflow question.

Source code / logs

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
YarShevcommented, Feb 11, 2022

There’s probably something else to improve in mask. cc @dchigarev

0reactions
prutskovcommented, Mar 11, 2022

I checked Series.drop with different data sizes to find the main bottlenecks. All results are for Ray engine, 112 workers.

  1. The first benchmark with drop by all index:
Code
from time import time

import modin.pandas as mpd
from modin.config import BenchmarkMode

BenchmarkMode.put(True)

mdf = mpd.Series([0] * 100_000_000)
pdf = mdf._to_pandas()

t = time()
pdf.drop(pdf.index)
print(f"t_all_pd: {time() - t} s")

t = time()
mdf.drop(mdf.index)
print(f"t_all_md: {time() - t} s")

The results for this is the next:

Shape(rows) 10m, time s 100m, time s
pandas 0.0471 0.4774
modin 14.9805 148.2909
modin hotspot 1 14.77 140.97
modin hotspot 2 0.2 7.2

modin hotspot 1 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/base.py#L1192

modin hotspot 2 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/core/storage_formats/pandas/query_compiler.py#L2280-L2291

  1. The second benchmark is for dropping only by [0]:
Code
from time import time

import modin.pandas as mpd
from modin.config import BenchmarkMode

BenchmarkMode.put(True)

mdf = mpd.Series([0] * 100_000_000)
pdf = mdf._to_pandas()

t = time()
pdf.drop([0])
print(f"t_single_pd: {time() - t} s")

t = time()
mdf.drop([0])
print(f"t_single_md: {time() - t} s")

The results for this is the next:

Shape(rows) 10m, time s 100m, time s
pandas 0.3512 3.7244
modin 3.3007 42.9572
modin hotspot 1 1.34 20.29
modin hotspot 2 1.69 20.33

modin hotspot 1 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/core/storage_formats/pandas/query_compiler.py#L2280-L2291

modin hotspot 2 is: https://github.com/modin-project/modin/blob/c17dde71ae8114723a13279e36bf5532dcba328a/modin/core/dataframe/pandas/dataframe/dataframe.py#L628-L630

The same hotspot in self._get_dict_of_block_index is observed in https://github.com/modin-project/modin/issues/4268.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modin df iterrows is painfully slow. Any alternative to speed it ...
Let me explain - I'm merging 5 datasets one by one. After EACH merge this above operation happens. Total records after 5 merges...
Read more >
How To Make Your Pandas Loop 71803 Times Faster
Looping through Pandas DataFrames can be very slow — I will show you some very fast options. If you use Python and Pandas...
Read more >
How to Speed up Pandas by 4x with one line of code
Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU ...
Read more >
How to Speedup Pandas with One-Line change using Modin
In this article, we are going to see how to increase the speed of computation of the pandas using modin library. Modin is...
Read more >
Pandas — Optimize Memory and Speed Operation
Result: Drop from 38.1 MB to 9.5 MB in Memory usage i.e. 75% reduction ... Image 04 — Pandas and Modin read_csv comparison...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found