question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How is `groupby().apply()` parallelized?

See original GitHub issue

(I see you want “general questions” written to a mailing-list, but I don’t know how that works, so I hope it’s OK to write my question here. I apologize for any inconvenience.)

It’s a very interesting project you have made! I am currently testing to see if I can use it as the backend in a new financial data API I am working on, because I have some quite slow functions that group data by stock-tickers and perform calculations for each ticker individually.

I am particularly interested in parallelizing computations of the form df.groupby('Ticker').apply(func), but currently Modin seems to take the same time as normal Pandas for this kind of computation.

Here’s an example of how I would parallelize it using Python multiprocessing.Pool:

# Apply this function for each stock individually.
def func(df_grp):
    # Do something ...
    return df_grp_result


# Split original DataFrame into sub-groups.
groups_iter = df.groupby('Ticker')
tickers, df_groups = zip(*groups_iter)

# Parallel processing for each DataFrame sub-group.
pool = multiprocessing.Pool()
result = pool.map(func, df_groups)
pool.close()
pool.join()

# Combine results back into a single DataFrame.
df_result = pd.concat(result, axis=0)

But I can’t actually implement it like this, because multiprocessing.Pool has several limitations that don’t work so well with how the rest of my code-base has been structured (e.g. func cannot be a local function, or a lambda-function, etc.) I have also tried the pathos / multiprocess library, but it’s actually slower than a single process for this kind of computation.

So I am wondering if you are planning on supporting the above kind of parallelization for groupby().apply()?

Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:4
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
vnlitvinovcommented, Mar 10, 2021

I also couldn’t figure out how to revert to regular pandas just for this function and ended up disabling modin entirely for this workflow, which is a shame!

@jonschwenk you could convert from Modin DataFrame to Pandas and back using patterns like this:

import modin.pandas as pd
from modin.utils import to_pandas
# do stuff
df = func_that_produces_modin_dataframe()
# do more stuff
...
# convert to pandas
pandas_df = to_pandas(df)
pandas_df.groupby('foo').apply(my_func)
# convert back to modin
df = pd.DataFrame(pandas_df)
# go on as usual using awesome Modin powers
1reaction
jonschwenkcommented, Nov 20, 2020

I am also finding that I have to disable modin when using groupby() + apply(). In my case, I need to use different functions to group different columns, so I place these lambda functions in a custom function that is passed to apply(). This only works if I can access all columns of the dataframe–parallelizing over columns seems to only make sense if the same function is used for all columns. Perhaps a flag could catch these cases and revert to regular pandas? I also couldn’t figure out how to revert to regular pandas just for this function and ended up disabling modin entirely for this workflow, which is a shame!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelize apply after pandas groupby - Stack Overflow
This seems to work, although it really should be built in to pandas import pandas as pd from joblib import Parallel, delayed import ......
Read more >
Hint at a better parallelization of groupby in Pandas
Parallelizing every group creates a chunk of data for each group. Each chunk needs to be transfered to cores in order to be...
Read more >
Performs a Pandas groupby operation in parallel - gists · GitHub
"""Performs a Pandas groupby operation in parallel. Example usage: import pandas as pd. df = pd.DataFrame({'A ...
Read more >
Pandarallel — A simple and efficient tool to parallelize your ...
Parallel on 4 cores (lower is better). Except for df.groupby.col_name.rolling.apply , where speed increases only by a x3.2 factor, ...
Read more >
PYTHON : Parallelize apply after pandas groupby - YouTube
PYTHON : Parallelize apply after pandas groupby [ Gift : Animated Search Engine : https://www.hows.tech/p/recommended.html ] PYTHON ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found