question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby - apply applies to the first group twice

See original GitHub issue

This bug is easy to reproduce:

def printA(group):
    print group.A.values[0]
    return 0
pandas.DataFrame({'A': [0,0,1,1], 'B': [0,1,0,1]}).groupby('A').apply(printA)

This should print

0
1

and return a Series containing two zeroes. Although it does return the expected Series, it prints

0
0
1

So, the function printA is being applied to the first group twice. This doesn’t seem to affect the eventual returned result of apply, but it duplicates any side effects, and if printA performed some expensive computation, it would increased the time to completion.

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Reactions:7
  • Comments:17 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
jsw-fnalcommented, Jul 13, 2014

I see that there is an old enhancement request open to add the slow_apply method I’m requesting (or something similar). So let me lend my voice to those calling for such a thing.

Alternatively, in some cases it might be possible to capture the results of fast_apply from the first group, and not call the user function again in those cases.

Maybe at some point I’ll even take a crack at implementing these things myself.

FWIW, I don’t buy the argument that these things are an “implementation detail” – The way in which my code is called is very important to me, and that brings it quite a bit beyond the level of “implementation detail”.

1reaction
jsw-fnalcommented, Jul 13, 2014

I think we’re talking at cross purposes. I’m asking a general question (and making a general suggestion and/or feature request), not asking for help optimizing my code. You seemed to fundamentally misunderstand what I was talking about (the iteritems() example was not remotely applicable) and you requested my specific code. I thought you wanted the code to help clarify the general question, but you dove in and tried to tell me how to optimize it. Not what I was looking for.

Here is some code which illustrates the general problem without providing specifics to get distracted by:

def func(group):
    # this must be applied group-wise for reasons beyond my control
    group['newcol'] = expensive_and_effectful_function(group['col1'], group['col2'], group['col3'])
    return group

#newdf = df.groupby(foo).apply(func) # don't do this because it is expensive and the side-effects clobber something

newstuff = [expensive_and_effectful_function(group['col1'], group['col2'], group['col3']) for name, group in df.groupby(foo)]
# But now what do I do with newstuff to get back a dataframe with the new computed column included?

def make_alternate_func():
    first_run = True
    def alternate_func(group):
        if first_run:
            first_run = False
            raise Exception
        return func(group)
    return alternate_func

newdf = df.groupby(foo).apply(make_alternate_func()) # This one works just fine, but I gather it uses the "slow path" because it raises an exception

By “effectful”, I mean something like disk writes, not modifications to the preexisting columns of group. All I want to do is to compute a new column, not modify the old columns. Does that count as “mutating the passed in data”? Is expensive_and_effectful_function doomed to follow the “slow path” no matter what? If that’s the case, then I’ll just raise an exception on the first run and avoid the duplication that way, even though it’s an ugly kludge.

To put it another way, the duplication of the first run is to allow apply to figure something out. If I already know what it is trying to figure out, can I tell it and save it the trouble?

And, lest I come off as too annoyed, demanding, or ungrateful, let me take the opportunity to thank all the pandas developers/contributors. It is really a wonderful tool, and I use it almost daily.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas GroupBy.apply method duplicates first group
Oh so basically Pandas will still run apply twice on the first row. This fix only applies to the group in groupby.apply. Damn....
Read more >
GroupBy.apply
In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code...
Read more >
Pandas groupby.apply() method duplicates first group
Suppose, we are given a DataFrame with some columns and we need to apply groupby for two columns, now the output shows the...
Read more >
Group and Aggregate by One or More Columns in Pandas
First we'll group by Team with Pandas' groupby function. ... Applying multiple aggregation functions to a single column will result in a multiindex....
Read more >
Pandas GroupBy - Count the occurrences of each ...
DataFrame.groupby() method is used to separate the DataFrame into groups. It will generate the number of similar data counts present in a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found