Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Series.drop() is 10x slower on Modin than on Pandas

See original GitHub issue

System information

OS Platform and Distribution: MacBook Pro (16-inch, 2019) with MacOS BigSur 11.5.2
Memory: 16 GB 2667 MHz DDR4
Modin version (modin.__version__): 0.12.0+40.g7e85c5df
Python version: 3.8.8
Code we can use to reproduce:

import modin.pandas as pd
matches = pd.Series([0] * 16000000)
%time repr(matches.drop([0]))
%time repr(matches.drop(matches.index))

Describe the problem

I am using Ray with 16 partitions, and Dask with 12 workers.

From the above snippet, dropping the first entry in the series takes 9.64 seconds on modin[ray], 9.93 seconds on modin[dask], and .433 seconds on pandas. Dropping the entire series takes 24.4 seconds on modin[ray], 24.9 seconds on modin[dask], and 2.12 seconds on pandas.

This bug came up when I was investigating this StackOverflow question.

Source code / logs

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

YarShevcommented, Feb 11, 2022

There’s probably something else to improve in mask. cc @dchigarev

0reactions

prutskovcommented, Mar 11, 2022

I checked Series.drop with different data sizes to find the main bottlenecks. All results are for Ray engine, 112 workers.

The first benchmark with drop by all index:

Code

from time import time

import modin.pandas as mpd
from modin.config import BenchmarkMode

BenchmarkMode.put(True)

mdf = mpd.Series([0] * 100_000_000)
pdf = mdf._to_pandas()

t = time()
pdf.drop(pdf.index)
print(f"t_all_pd: {time() - t} s")

t = time()
mdf.drop(mdf.index)
print(f"t_all_md: {time() - t} s")

The results for this is the next:

Shape(rows)	10m, time s	100m, time s
pandas	0.0471	0.4774
modin	14.9805	148.2909
modin hotspot 1	14.77	140.97
modin hotspot 2	0.2	7.2

modin hotspot 1 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/pandas/base.py#L1192

modin hotspot 2 is: https://github.com/modin-project/modin/blob/ee2440c53a1e3bd47736776e7c643f05c4a0db70/modin/core/storage_formats/pandas/query_compiler.py#L2280-L2291

The second benchmark is for dropping only by [0]:

Code

from time import time

import modin.pandas as mpd
from modin.config import BenchmarkMode

BenchmarkMode.put(True)

mdf = mpd.Series([0] * 100_000_000)
pdf = mdf._to_pandas()

t = time()
pdf.drop([0])
print(f"t_single_pd: {time() - t} s")

t = time()
mdf.drop([0])
print(f"t_single_md: {time() - t} s")

The results for this is the next: