Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Understanding Groupbyapply

See original GitHub issue

Hello there, firstly thank you for such an amazing package that bridges the gap between Pandas and PySpark. I started using koalas approximately 1 week back and everything was intuitive till the time i stumbled upon koalas.Groupby.Apply.

Code:

if __name__ == '__main__':

        ks_df = ks.DataFrame(features_data)
        ks_df_info_abt_train = ks_df.groupby(['div_nbr', 'store_nbr']).apply(_koalas_train)
        
        def _koalas_train(frame):
                  out_frame = frame.copy()
                  out_frame = frame['trans_type_value'].sum()
                  return out_frame

Here features_data is a pd.Dataframe.

Output from Koalas.Groupby.Apply:

Screen Shot 2019-09-27 at 2 27 26 PM

Output from Pandas.Groupby.Apply: Screen Shot 2019-09-27 at 2 30 13 PM

As you can see, the output from pandas Groupby apply is as expected, but the output from Koalas Groupby apply is not right. Could you guid me towards the right direction by pointing out any logical mistake that i might have made or anything else. Thank you once again.

Koalas version - 0.18.0 Pandas version - 0.23.4 PySpark - 2.4.3

Issue Analytics

State:
Created 4 years ago
Comments:29 (19 by maintainers)

Top GitHub Comments

2reactions

HyukjinKwoncommented, Oct 4, 2019

Thank YOU guys 😃

2reactions

HyukjinKwoncommented, Oct 2, 2019

BTW, it should be safe to do because it’s guaranteed to have the same whole grouped data in func from spark_df.groupby(...).apply(func).

Say, if we have a Spark DataFrame as below:

>>> df.show()

+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|  2|
|  2|  2|
|  2|  3|
+---+---+

from pyspark.sql.functions import PandasUDFType, pandas_udf

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def func(pdf):
    print(pdf)
    return pdf

df.groupby("a").apply(func).show()

The func’s pdf becomes, each grouped by a.

   a  b
0  1  1
1  1  2

   a  b
0  2  2
1  2  3

So, applying pdf.groupby(...).apply(...) won’t change its output (although it wastes a bit of computation by additional groupby in pandas) but only correct the index as pandas’ style:

from pyspark.sql.functions import PandasUDFType, pandas_udf

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def func(pdf):
    pdf = pdf.groupby("a").apply(lambda pdf: pdf)
    print(pdf)
    return pdf

df.groupby("a").apply(func).show()