question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding Groupbyapply

See original GitHub issue

Hello there, firstly thank you for such an amazing package that bridges the gap between Pandas and PySpark. I started using koalas approximately 1 week back and everything was intuitive till the time i stumbled upon koalas.Groupby.Apply.

Code:

if __name__ == '__main__':

        ks_df = ks.DataFrame(features_data)
        ks_df_info_abt_train = ks_df.groupby(['div_nbr', 'store_nbr']).apply(_koalas_train)
        
        def _koalas_train(frame):
                  out_frame = frame.copy()
                  out_frame = frame['trans_type_value'].sum()
                  return out_frame

Here features_data is a pd.Dataframe.

Output from Koalas.Groupby.Apply:

Screen Shot 2019-09-27 at 2 27 26 PM

Output from Pandas.Groupby.Apply: Screen Shot 2019-09-27 at 2 30 13 PM

As you can see, the output from pandas Groupby apply is as expected, but the output from Koalas Groupby apply is not right. Could you guid me towards the right direction by pointing out any logical mistake that i might have made or anything else. Thank you once again.

Koalas version - 0.18.0 Pandas version - 0.23.4 PySpark - 2.4.3

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:29 (19 by maintainers)

github_iconTop GitHub Comments

2reactions
HyukjinKwoncommented, Oct 4, 2019

Thank YOU guys 😃

2reactions
HyukjinKwoncommented, Oct 2, 2019

BTW, it should be safe to do because it’s guaranteed to have the same whole grouped data in func from spark_df.groupby(...).apply(func).

Say, if we have a Spark DataFrame as below:

>>> df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  2|
|  2|  3|
+---+---+
from pyspark.sql.functions import PandasUDFType, pandas_udf

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def func(pdf):
    print(pdf)
    return pdf

df.groupby("a").apply(func).show()

The func’s pdf becomes, each grouped by a.

   a  b
0  1  1
1  1  2
   a  b
0  2  2
1  2  3

So, applying pdf.groupby(...).apply(...) won’t change its output (although it wastes a bit of computation by additional groupby in pandas) but only correct the index as pandas’ style:

from pyspark.sql.functions import PandasUDFType, pandas_udf

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def func(pdf):
    pdf = pdf.groupby("a").apply(lambda pdf: pdf)
    print(pdf)
    return pdf

df.groupby("a").apply(func).show()
   a  b
0  1  1
1  1  2
   a  b
0  2  2
1  2  3
Read more comments on GitHub >

github_iconTop Results From Across the Web

All About Pandas Groupby Explained with 25 Examples
The groupby is one of the most frequently used Pandas functions in data analysis. It is used for grouping the data points (i.e....
Read more >
Pandas groupby() Explained With Examples
groupby () function is used to collect identical data into groups and perform aggregate functions on the grouped data.
Read more >
pandas GroupBy: Your Guide to Grouping Data in Python
In this tutorial, you'll learn how to work adeptly with the pandas GroupBy facility while mastering ways to manipulate, transform, ...
Read more >
Pandas GroupBy explained Step by Step | by Naomi Fridman
Group By : split-apply-combine ; Splitting the data into groups based on some criteria. ; Applying a function to each group independently. ;...
Read more >
pandas.core.groupby.GroupBy.apply
The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found