Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How is `groupby().apply()` parallelized?

See original GitHub issue

(I see you want “general questions” written to a mailing-list, but I don’t know how that works, so I hope it’s OK to write my question here. I apologize for any inconvenience.)

It’s a very interesting project you have made! I am currently testing to see if I can use it as the backend in a new financial data API I am working on, because I have some quite slow functions that group data by stock-tickers and perform calculations for each ticker individually.

I am particularly interested in parallelizing computations of the form df.groupby('Ticker').apply(func), but currently Modin seems to take the same time as normal Pandas for this kind of computation.

Here’s an example of how I would parallelize it using Python multiprocessing.Pool:

# Apply this function for each stock individually.
def func(df_grp):
    # Do something ...
    return df_grp_result


# Split original DataFrame into sub-groups.
groups_iter = df.groupby('Ticker')
tickers, df_groups = zip(*groups_iter)

# Parallel processing for each DataFrame sub-group.
pool = multiprocessing.Pool()
result = pool.map(func, df_groups)
pool.close()
pool.join()

# Combine results back into a single DataFrame.
df_result = pd.concat(result, axis=0)

But I can’t actually implement it like this, because multiprocessing.Pool has several limitations that don’t work so well with how the rest of my code-base has been structured (e.g. func cannot be a local function, or a lambda-function, etc.) I have also tried the pathos / multiprocess library, but it’s actually slower than a single process for this kind of computation.

So I am wondering if you are planning on supporting the above kind of parallelization for groupby().apply()?

Thanks!

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

vnlitvinovcommented, Mar 10, 2021

I also couldn’t figure out how to revert to regular pandas just for this function and ended up disabling modin entirely for this workflow, which is a shame!

@jonschwenk you could convert from Modin DataFrame to Pandas and back using patterns like this:

import modin.pandas as pd
from modin.utils import to_pandas
# do stuff
df = func_that_produces_modin_dataframe()
# do more stuff
...
# convert to pandas
pandas_df = to_pandas(df)
pandas_df.groupby('foo').apply(my_func)
# convert back to modin
df = pd.DataFrame(pandas_df)
# go on as usual using awesome Modin powers

1reaction

jonschwenkcommented, Nov 20, 2020

I am also finding that I have to disable modin when using groupby() + apply(). In my case, I need to use different functions to group different columns, so I place these lambda functions in a custom function that is passed to apply(). This only works if I can access all columns of the dataframe–parallelizing over columns seems to only make sense if the same function is used for all columns. Perhaps a flag could catch these cases and revert to regular pandas? I also couldn’t figure out how to revert to regular pandas just for this function and ended up disabling modin entirely for this workflow, which is a shame!