question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi GroupBy Aggregates and Enhancements

See original GitHub issue

GroupBy Enhancements

Hi Pandas developers and enthusiasts! This is my first time posting, so if I did anything wrong, sorry!

I really really love Pandas, and I was always astonished at its super fast speed. Every dataset, including many millions of rows took only milliseconds-seconds to process!

But, when I wanted to use the GroupBy function, I noticed a problem. Since the Agg function uses a dictionary, that means referencing 1 column to do 2 things is not allowed. Also, I was shocked there was no mode function, nor range, IQR etc etc.

So, maybe I’m annoying you guys, I really wanted to make Pandas even easier to use, and more feature complete!

So I made a similar groupby function, but this time, you can write your own functions, and other methods are provided 😃

Thanks for reading!

groupby(data, by = "vendor_id", how = 'mean')                     #Get all the means
groupby(data, by = ["vendor_id","passenger_count"], how = 'mode')   #Get all the modes
groupby(data, by = ["vendor_id","passenger_count"], 
                                how = 'pickup_longitude = iqr')   #Get IQR of one col

groupby(data, by = ["vendor_id","passenger_count"], 
                                how = 'pickup_longitude = mean, pickup_longitude = percent_total')  
#Same column get mean, percentage of total

groupby(data, by = ["vendor_id","passenger_count"], how = 'pickup_longitude = mean, pickup_longitude = percent_total')

groupby(data, by = ["vendor_id","store_and_fwd_flag"], how = 'passenger_count = f(x.kurt()), passenger_count = f(x.std())')
#Provide own functions with f(x) notation

Functions: Choose from: MODE/LEAST: 1. Mode 2. Mode_group 3. Least 4. Least_group FREQUENCY: 1. Freq_total 2. Freq_group 3. Mode Count 4. Least Count STATISTICS: 1. Range 2. Variance 3. Std 4. IQR FUNCTIONS: Use f(x.var()) notation

def groupby(data, by, how, cols = "all", decimals = 3):
    if cols == 'all': cols = columns(data)
    new = 0;
    if type(by) is str: by = [by]
    if "=" not in how:
        try:
            return eval('data.groupby(by = by)[diff(cols,by)].'+how+'()')
        except: new = 1;
    if new == 1: 
        how = [x+" = "+how for x in diff(cols,by)]
        how = ', '.join(how)

    if "=" in how:
        to_do = []
        for x in how.split(','):
            a = tuple(y.lstrip().rstrip() for y in x.split('='))
            to_do.append(a)

        grouped = data.groupby(by = by)

        if len(by) <= 2: second = by[0];
        elif len(by) > 2: second = by[0:-1];

        result = None; p = []
        
        for key in to_do:
            try:
                new_name = key[0]+">"+key[1]
                what = key[1].lower()
                
                if what == 'mode':
                    right = grouped[key[0]].apply(lambda x: x.value_counts().index[0]).reset_index()[by+[key[0]]]
                
                elif what  == 'mode_count':
                    right = grouped[key[0]].apply(lambda x: x.value_counts().iloc[0]).reset_index()[by+[key[0]]]

                elif what  == 'least_count':
                    right = grouped[key[0]].apply(lambda x: x.value_counts().iloc[-1]).reset_index()[by+[key[0]]]

                elif what == 'mode_group':
                    right = grouped[key[0]].apply(lambda x: (x.value_counts().iloc[0]/x.value_counts().sum()*100).round(decimals)).reset_index()[by+[key[0]]]
                
                elif what == 'least_group':
                    right = grouped[key[0]].apply(lambda x: (x.value_counts().iloc[-1]/x.value_counts().sum()*100).round(decimals)).reset_index()[by+[key[0]]]
                
                elif what == 'least':
                    right = grouped[key[0]].apply(lambda x: x.value_counts().index[-1]).reset_index()[by+[key[0]]]

                elif what == 'std':
                    right = grouped[key[0]].apply(lambda x: x.std().round(decimals)).reset_index()[by+[key[0]]]
                    
                elif what == 'var':
                    right = grouped[key[0]].apply(lambda x: x.var().round(decimals)).reset_index()[by+[key[0]]]
                    
                elif what == 'iqr':
                    right = grouped[key[0]].apply(lambda x: x.quantile(0.75)-x.quantile(0.25)).reset_index()[by+[key[0]]]

                elif what == 'range':
                    right = grouped[key[0]].apply(lambda x: x.max() - x.min()).reset_index()[by+[key[0]]]
                
                elif what in ["freq_total", "percent_total"]:
                    if what == "freq_total":
                        right = pd.DataFrame(grouped[key[0]].count()).reset_index()

                    elif what == 'percent_total':
                        right = pd.DataFrame(grouped[key[0]].sum()).reset_index()

                    summed = right[key[0]].sum()
                    right[key[0]] = round((right[key[0]] / summed*100),decimals)

                elif 'group' in what:
                    if what == "freq_group":
                        right = pd.DataFrame(grouped[key[0]].count()).reset_index()
                        summed = data.groupby(by = second).agg({key[0]:'count'}).reset_index()

                    elif what == 'percent_group':
                        right = pd.DataFrame(grouped[key[0]].sum()).reset_index()
                        summed = data.groupby(by = second).agg({key[0]:'sum'}).reset_index()

                    tot_name = key[0]+">total";
                    summed.columns = change(columns(summed), key[0], tot_name)
                    df = pd.merge(left = right, right = summed, on = second, how = 'left')
                    right[key[0]] = round((df[key[0]]/df[tot_name]*100),decimals)
                    
                elif 'f(' in what:
                    command = what.replace('f(','')[0:-1]
                    right = eval('grouped[key[0]].apply(lambda x: '+command+').reset_index()[by+[key[0]]]')
                
                else: 
                    right = grouped.agg({key[0]:key[1]}).reset_index()
                        
                right.columns = change(columns(right), key[0], new_name)
                
                if result is None: result = exc(grouped.agg({key[0]:'count'}).reset_index(), key[0])
                result[right.columns[-1]] = right[right.columns[-1]]

            except: pass;
        try:
            return result.set_index(by, drop = True)
        except:
            print("Wrong method\n\nChoose from:\nMODE/LEAST: 1. Mode  2. Mode_group  3. Least  4. Least_group")
            print("FREQUENCY: 1. Freq_total  2. Freq_group  3. Mode Count  4. Least Count")
            print("STATISTICS: 1. Range  2. Variance  3. Std  4. IQR\n")

def columns(x):
    return x.columns.tolist()

def diff(want, rem):
    w = copy(want)
    for j in w:
        if j in rem: w.remove(j)
    for j in rem:
        if j in w: w.remove(j)
    return w
    
def exc(data, x):
    return data[diff(columns(data), x)]

def change(l, a, y):
    return [x if x != a else y for x in l]

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Sep 19, 2017

Multi columns is when u wanted to find both the mean and median of one column. The aggregate method can’t do that and can only aggregate the mean and misses the median.

This should be possible with a syntax like df.groupby(..).agg({'col': ['mean', 'median']})

0reactions
danielhanchencommented, Sep 19, 2017

Yep sorry!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multiple aggregations of the same column using pandas ...
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg() , known ...
Read more >
pandas.core.groupby.DataFrameGroupBy.aggregate
pandas.core.groupby.DataFrameGroupBy.aggregate# ... Aggregate using one or more operations over the specified axis. ... Function to use for aggregating the data. If ...
Read more >
Create a Glide Aggregate query with multiple GroupBy and ...
HI, I want to implement create a Glide Record to implement the following query /* SELECT cmdb_ci, type, metric_name,
Read more >
GroupBy-fu: improvements in grouping and aggregating ...
Perform an aggregation, like computing the sum of mean of each group. ... The ability to group by multiple criteria (just like SQL)...
Read more >
Group and Aggregate by One or More Columns in Pandas
Here's a quick example of how to group on one or multiple columns and summarise data with aggregation functions using Pandas.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found