How is `groupby().apply()` parallelized?
See original GitHub issue(I see you want “general questions” written to a mailing-list, but I don’t know how that works, so I hope it’s OK to write my question here. I apologize for any inconvenience.)
It’s a very interesting project you have made! I am currently testing to see if I can use it as the backend in a new financial data API I am working on, because I have some quite slow functions that group data by stock-tickers and perform calculations for each ticker individually.
I am particularly interested in parallelizing computations of the form df.groupby('Ticker').apply(func)
, but currently Modin seems to take the same time as normal Pandas for this kind of computation.
Here’s an example of how I would parallelize it using Python multiprocessing.Pool
:
# Apply this function for each stock individually.
def func(df_grp):
# Do something ...
return df_grp_result
# Split original DataFrame into sub-groups.
groups_iter = df.groupby('Ticker')
tickers, df_groups = zip(*groups_iter)
# Parallel processing for each DataFrame sub-group.
pool = multiprocessing.Pool()
result = pool.map(func, df_groups)
pool.close()
pool.join()
# Combine results back into a single DataFrame.
df_result = pd.concat(result, axis=0)
But I can’t actually implement it like this, because multiprocessing.Pool
has several limitations that don’t work so well with how the rest of my code-base has been structured (e.g. func
cannot be a local function, or a lambda-function, etc.) I have also tried the pathos / multiprocess library, but it’s actually slower than a single process for this kind of computation.
So I am wondering if you are planning on supporting the above kind of parallelization for groupby().apply()
?
Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:5 (3 by maintainers)
Top GitHub Comments
@jonschwenk you could convert from Modin DataFrame to Pandas and back using patterns like this:
I am also finding that I have to disable modin when using groupby() + apply(). In my case, I need to use different functions to group different columns, so I place these lambda functions in a custom function that is passed to apply(). This only works if I can access all columns of the dataframe–parallelizing over columns seems to only make sense if the same function is used for all columns. Perhaps a flag could catch these cases and revert to regular pandas? I also couldn’t figure out how to revert to regular pandas just for this function and ended up disabling modin entirely for this workflow, which is a shame!