Understanding Groupbyapply
See original GitHub issueHello there, firstly thank you for such an amazing package that bridges the gap between Pandas and PySpark.
I started using koalas approximately 1 week back and everything was intuitive till the time i stumbled upon koalas.Groupby.Apply
.
Code:
if __name__ == '__main__':
ks_df = ks.DataFrame(features_data)
ks_df_info_abt_train = ks_df.groupby(['div_nbr', 'store_nbr']).apply(_koalas_train)
def _koalas_train(frame):
out_frame = frame.copy()
out_frame = frame['trans_type_value'].sum()
return out_frame
Here features_data is a pd.Dataframe.
Output from Koalas.Groupby.Apply:
Output from Pandas.Groupby.Apply:
As you can see, the output from pandas Groupby apply is as expected, but the output from Koalas Groupby apply is not right. Could you guid me towards the right direction by pointing out any logical mistake that i might have made or anything else. Thank you once again.
Koalas version - 0.18.0 Pandas version - 0.23.4 PySpark - 2.4.3
Issue Analytics
- State:
- Created 4 years ago
- Comments:29 (19 by maintainers)
Top Results From Across the Web
All About Pandas Groupby Explained with 25 Examples
The groupby is one of the most frequently used Pandas functions in data analysis. It is used for grouping the data points (i.e....
Read more >Pandas groupby() Explained With Examples
groupby () function is used to collect identical data into groups and perform aggregate functions on the grouped data.
Read more >pandas GroupBy: Your Guide to Grouping Data in Python
In this tutorial, you'll learn how to work adeptly with the pandas GroupBy facility while mastering ways to manipulate, transform, ...
Read more >Pandas GroupBy explained Step by Step | by Naomi Fridman
Group By : split-apply-combine ; Splitting the data into groups based on some criteria. ; Applying a function to each group independently. ;...
Read more >pandas.core.groupby.GroupBy.apply
The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank YOU guys 😃
BTW, it should be safe to do because it’s guaranteed to have the same whole grouped data in
func
fromspark_df.groupby(...).apply(func)
.Say, if we have a Spark DataFrame as below:
The
func
’spdf
becomes, each grouped bya
.So, applying
pdf.groupby(...).apply(...)
won’t change its output (although it wastes a bit of computation by additional groupby in pandas) but only correct the index as pandas’ style: