groupby - apply applies to the first group twice
See original GitHub issueThis bug is easy to reproduce:
def printA(group):
print group.A.values[0]
return 0
pandas.DataFrame({'A': [0,0,1,1], 'B': [0,1,0,1]}).groupby('A').apply(printA)
This should print
0
1
and return a Series
containing two zeroes. Although it does return the expected Series
, it prints
0
0
1
So, the function printA
is being applied to the first group twice. This doesn’t seem to affect the eventual returned result of apply, but it duplicates any side effects, and if printA
performed some expensive computation, it would increased the time to completion.
Issue Analytics
- State:
- Created 9 years ago
- Reactions:7
- Comments:17 (9 by maintainers)
Top Results From Across the Web
Pandas GroupBy.apply method duplicates first group
Oh so basically Pandas will still run apply twice on the first row. This fix only applies to the group in groupby.apply. Damn....
Read more >GroupBy.apply
In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code...
Read more >Pandas groupby.apply() method duplicates first group
Suppose, we are given a DataFrame with some columns and we need to apply groupby for two columns, now the output shows the...
Read more >Group and Aggregate by One or More Columns in Pandas
First we'll group by Team with Pandas' groupby function. ... Applying multiple aggregation functions to a single column will result in a multiindex....
Read more >Pandas GroupBy - Count the occurrences of each ...
DataFrame.groupby() method is used to separate the DataFrame into groups. It will generate the number of similar data counts present in a ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I see that there is an old enhancement request open to add the
slow_apply
method I’m requesting (or something similar). So let me lend my voice to those calling for such a thing.Alternatively, in some cases it might be possible to capture the results of
fast_apply
from the first group, and not call the user function again in those cases.Maybe at some point I’ll even take a crack at implementing these things myself.
FWIW, I don’t buy the argument that these things are an “implementation detail” – The way in which my code is called is very important to me, and that brings it quite a bit beyond the level of “implementation detail”.
I think we’re talking at cross purposes. I’m asking a general question (and making a general suggestion and/or feature request), not asking for help optimizing my code. You seemed to fundamentally misunderstand what I was talking about (the
iteritems()
example was not remotely applicable) and you requested my specific code. I thought you wanted the code to help clarify the general question, but you dove in and tried to tell me how to optimize it. Not what I was looking for.Here is some code which illustrates the general problem without providing specifics to get distracted by:
By “effectful”, I mean something like disk writes, not modifications to the preexisting columns of
group
. All I want to do is to compute a new column, not modify the old columns. Does that count as “mutating the passed in data”? Isexpensive_and_effectful_function
doomed to follow the “slow path” no matter what? If that’s the case, then I’ll just raise an exception on the first run and avoid the duplication that way, even though it’s an ugly kludge.To put it another way, the duplication of the first run is to allow
apply
to figure something out. If I already know what it is trying to figure out, can I tell it and save it the trouble?And, lest I come off as too annoyed, demanding, or ungrateful, let me take the opportunity to thank all the pandas developers/contributors. It is really a wonderful tool, and I use it almost daily.