Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby-agg all columns based on one column

See original GitHub issue

Hello, I have a dataset that looks roughly like this:

#    category    date        x                     y
0    c           2020-01-20  0.6985333564957753    61
1    b           2020-01-19  0.011782836532168281  63
2    c           2020-01-18  -1.5929651389254533   16
3    c           2020-01-17  -1.614997608433403    97
4    b           2020-01-16  0.5646888996607152    23
...  ...         ...         ...                   ...
15   c           2020-01-05  -0.3777229791754252   55
16   b           2020-01-04  -1.8989497908039141   44
17   c           2020-01-03  0.5762175737242692    99
18   c           2020-01-02  1.4724334531958192    22
19   a           2020-01-01  1.1491668587221784    2

I need to group by category and then select the values for x and y that correspond to the maximum date within each group. With Pandas, I would do:

df[df.groupby('category')['date'].idxmax()]

Is there a way to achieve this using Vaex? I was able to select the minimum date within each group using:

df.groupby('category').agg({'x': vx.agg.first('x', 'date'),
                                'y': vx.agg.first('y', 'date')
                               })

The documentation for vaex.agg.first doesn’t indicate exactly how to use the order_expression parameter so I wasn’t sure if there is a way to reverse-order the date expression. Is this possible?

Thank you

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

yohplalacommented, Feb 20, 2022

Yes, we definitely need ‘last’. It is actually pretty important.

Albeit I am not sure such a message is a very constructive one, please, be aware that @maartenbreddels has initiated a PR for this with #1848. He is requesting some testing, that I have unfortunately not been able to provide. If you would like to give it a try, this may speed things up.

Bests,

1reaction

markbarnacommented, Jun 17, 2020

Hi Maarten, Thanks for your help. Your suggestion worked with some changes. First, I had to enclose the entire order_expression in quotes. Otherwise, I get: TypeError: unhashable type: 'Expression'. Also, it seems that the datatype conversion used on the date column must match the datatype of the column that you’re selecting from so float for column x and int for column y, in this case. Here is what worked:

dfg = df.groupby('category').agg({'x': vx.agg.first('x', '-date.astype("float")'),
                                'y': vx.agg.first('y', '-date.astype("int")')
                               })

Alternatively, this also worked to accomplish the same thing:

dfg = df.groupby('category').agg({'date': 'max'})
df = df.join(dfg, on='category', rsuffix='_max')
df = df[df['date'] == df['date_max']].drop(['date_max', 'category_max'])

Obviously, it’s not as concise, but I found it to still be much faster than using Pandas. Thanks