Why does Series.transform() exist?
See original GitHub issueThis is my first issue on GitHub, so apologies in advance if there’s something wrong with the format.
My issue does not have any expected output, I just really want to understand if and why the Series.transform()
method is not redundant. Overall, the transform()
methods are very similar to apply()
methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:
- When the DataFrame is grouped on several categories,
apply()
sends the entire sub-DataFrames within the function, whiletransform()
sends each column of each sub-DataFrame separately. That’s why columns can’t access values in other columns withintransform()
; - When the input passed to the function is an iterable of a certain length,
apply()
can still have the output of any length, whiletransform()
has a limitation of having to output an iterable of the same length as the input; - When the function outputs a scalar,
apply()
returns that scalar, whiletransform()
propagates that scalar to the iterable of the input length.
I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:
1 – not applicable. In both cases the function has a scalar input. 2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires. 3 – not applicable. The input length is always “1” (it’s considered “1” even when it’s an iterable), so there’s no need to propagate.
Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:
import pandas as pd
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
def return_df(x):
return pd.DataFrame([[4, 5], [3, 2]])
def return_series(x):
return pd.Series([1, 2])
df['a'].transform(return_df)
df['a'].transform(return_series)
If you try this code, you’ll see that it doesn’t matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that Series.transform()
has within itself become redundant. I can’t imagine any situation where Series.transform()
could behave in a different way from Series.apply()
. And that raises the question I posed: why does Series.transform()
exist?
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
@MarcoGorelli The question was about
Series.transform
, which does not allow aggregator broadcasting, unlikeSeriesGroupBy.transform
(for which aggregator broadcasting is the main use case).Admittedly it’s unclear to me why
Series.transform
does not support aggregator broadcasting, since I thought the point of addingagg
,transform
, andapply
was to mimic the groupby versions.Your observation 1 is wrong. Series.transform can also take a function that takes a Series. Your problem is that in your examples the return values have only two rows while your df had 4.
And you can also do multiple transformers:
Neither of those are available with
apply
.