question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby-agg all columns based on one column

See original GitHub issue

Hello, I have a dataset that looks roughly like this:

#    category    date        x                     y
0    c           2020-01-20  0.6985333564957753    61
1    b           2020-01-19  0.011782836532168281  63
2    c           2020-01-18  -1.5929651389254533   16
3    c           2020-01-17  -1.614997608433403    97
4    b           2020-01-16  0.5646888996607152    23
...  ...         ...         ...                   ...
15   c           2020-01-05  -0.3777229791754252   55
16   b           2020-01-04  -1.8989497908039141   44
17   c           2020-01-03  0.5762175737242692    99
18   c           2020-01-02  1.4724334531958192    22
19   a           2020-01-01  1.1491668587221784    2

I need to group by category and then select the values for x and y that correspond to the maximum date within each group. With Pandas, I would do:

df[df.groupby('category')['date'].idxmax()]

Is there a way to achieve this using Vaex? I was able to select the minimum date within each group using:

df.groupby('category').agg({'x': vx.agg.first('x', 'date'),
                                'y': vx.agg.first('y', 'date')
                               })

The documentation for vaex.agg.first doesn’t indicate exactly how to use the order_expression parameter so I wasn’t sure if there is a way to reverse-order the date expression. Is this possible?

Thank you

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
yohplalacommented, Feb 20, 2022

Yes, we definitely need ‘last’. It is actually pretty important.

Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848. He is requesting some testing, that I have unfortunately not been able to provide. If you would like to give it a try, this may speed things up.

Bests,

1reaction
markbarnacommented, Jun 17, 2020

Hi Maarten, Thanks for your help. Your suggestion worked with some changes. First, I had to enclose the entire order_expression in quotes. Otherwise, I get: TypeError: unhashable type: 'Expression'. Also, it seems that the datatype conversion used on the date column must match the datatype of the column that you’re selecting from so float for column x and int for column y, in this case. Here is what worked:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', '-date.astype("float")'),
                                'y': vx.agg.first('y', '-date.astype("int")')
                               })

Alternatively, this also worked to accomplish the same thing:

dfg = df.groupby('category').agg({'date': 'max'})
df = df.join(dfg, on='category', rsuffix='_max')
df = df[df['date'] == df['date_max']].drop(['date_max', 'category_max'])

Obviously, it’s not as concise, but I found it to still be much faster than using Pandas. Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apply multiple functions to multiple groupby columns
First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects...
Read more >
Group and Aggregate by One or More Columns in Pandas
Here's a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.
Read more >
Pandas GroupBy Multiple Columns Explained
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple...
Read more >
Pandas: How to group a dataframe by one or multiple columns?
In today's post we would like to show how to use the DataFrame Groupby method in pandas in order to aggregate data by...
Read more >
Pandas Groupby and Aggregate for Multiple Columns - Datagy
To use Pandas groupby with multiple columns, you can pass in a list of column headers directly into the method. The order in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found