groupby-agg all columns based on one column
See original GitHub issueHello, I have a dataset that looks roughly like this:
# category date x y
0 c 2020-01-20 0.6985333564957753 61
1 b 2020-01-19 0.011782836532168281 63
2 c 2020-01-18 -1.5929651389254533 16
3 c 2020-01-17 -1.614997608433403 97
4 b 2020-01-16 0.5646888996607152 23
... ... ... ... ...
15 c 2020-01-05 -0.3777229791754252 55
16 b 2020-01-04 -1.8989497908039141 44
17 c 2020-01-03 0.5762175737242692 99
18 c 2020-01-02 1.4724334531958192 22
19 a 2020-01-01 1.1491668587221784 2
I need to group by category and then select the values for x and y that correspond to the maximum date within each group. With Pandas, I would do:
df[df.groupby('category')['date'].idxmax()]
Is there a way to achieve this using Vaex? I was able to select the minimum date within each group using:
df.groupby('category').agg({'x': vx.agg.first('x', 'date'),
'y': vx.agg.first('y', 'date')
})
The documentation for vaex.agg.first
doesn’t indicate exactly how to use the order_expression
parameter so I wasn’t sure if there is a way to reverse-order the date expression. Is this possible?
Thank you
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Apply multiple functions to multiple groupby columns
First make a custom lambda function. Below, g references the group. When aggregating, g will be a Series. Passing g.index to df.ix[] selects...
Read more >Group and Aggregate by One or More Columns in Pandas
Here's a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.
Read more >Pandas GroupBy Multiple Columns Explained
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple...
Read more >Pandas: How to group a dataframe by one or multiple columns?
In today's post we would like to show how to use the DataFrame Groupby method in pandas in order to aggregate data by...
Read more >Pandas Groupby and Aggregate for Multiple Columns - Datagy
To use Pandas groupby with multiple columns, you can pass in a list of column headers directly into the method. The order in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848. He is requesting some testing, that I have unfortunately not been able to provide. If you would like to give it a try, this may speed things up.
Bests,
Hi Maarten, Thanks for your help. Your suggestion worked with some changes. First, I had to enclose the entire
order_expression
in quotes. Otherwise, I get:TypeError: unhashable type: 'Expression'
. Also, it seems that the datatype conversion used on thedate
column must match the datatype of the column that you’re selecting from sofloat
for columnx
andint
for columny
, in this case. Here is what worked:Alternatively, this also worked to accomplish the same thing:
Obviously, it’s not as concise, but I found it to still be much faster than using Pandas. Thanks