question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.GroupBy.apply()

See original GitHub issue

I’m continuously missing this behavior when converting my pandas scripts to koalas:

df = pd.DataFrame({"timestamp":[0.0, 0.5, 1.0, 0.0, 0.5],
              "car_id": ['A','A','A','B','B'], 
              "battery_charge": [100, 90, 80,100,90]
              })

print(df)

def calc_battery_usage(x):
    start_bat = x.sort_values('timestamp').iloc[0]['battery_charge']
    stop_bat = x.sort_values('timestamp').iloc[-1]['battery_charge']
    return start_bat - stop_bat

print(df.groupby('car_id').apply(calc_battery_usage))

That prints out:

   timestamp car_id  battery_charge
0        0.0      A             100
1        0.5      A              90
2        1.0      A              80
3        0.0      B             100
4        0.5      B              90

car_id
A    20
B    10

On koalas I just get:

PandasNotImplementedError: The method `pd.groupby.GroupBy.apply()` is not implemented yet.

I can take over and start implementing it myself, but would probably need some general guidelines on how to do it (probably using pysparks pandas_udf).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
icexellosscommented, Jun 3, 2019

Hi @patryk-oleniuk! I think it’d great to have groupby apply in koalas.

Spark’s groupby apply and Pandas’s groupby apply are pretty similar. However, two main difference that I can think of are:

  • Spark’s groupby apply need full return type specification on the udf definition
  • Spark’s groupby apply doesn’t keep index, while pandas groupby apply use the grouping key as index.

As a start, I’d probably try this something like:

@pandas_wraps(full_return_schema):
def calc_battery_usage(x):
    start_bat = x.sort_values('timestamp').iloc[0]['battery_charge']
    stop_bat = x.sort_values('timestamp').iloc[-1]['battery_charge']
    return start_bat - stop_bat

kdf.groupby('car_id').apply(calc_battery_usage)

And just wrap it with Spark’s groupby apply + pandas_udf

Does that make sense?

0reactions
devarshmlcommented, Sep 30, 2019

Thank you guys for the replies. I will surely check out all of the suggestions by tomorrow evening…!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas.core.groupby.GroupBy.apply
Apply function func group-wise and combine the results together. The function passed to apply must take a dataframe as its first argument and...
Read more >
Apply function to pandas groupby - python - Stack Overflow
The .agg() method here takes a function that is applied to all values of the groupby object. Share.
Read more >
How to Apply Function to Pandas Groupby - Statology
This tutorial explains how to use the groupby() and apply() functions together in pandas, including an example.
Read more >
pandas: Advanced groupby(), apply() and MultiIndex
pandas : Advanced groupby(), apply() and MultiIndex. Series.apply(): apply a function call across a vector. The function is called with each value in...
Read more >
Apply Operations To Groups In Pandas - GeeksforGeeks
Often data analysis requires data to be broken into groups to perform various operations on these groups. The GroupBy function in Pandas employs ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found