Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does Series.transform() exist?

See original GitHub issue

This is my first issue on GitHub, so apologies in advance if there’s something wrong with the format.

My issue does not have any expected output, I just really want to understand if and why the Series.transform() method is not redundant. Overall, the transform() methods are very similar to apply() methods, and as I was trying to figure out what the difference between them is (this Stack Overflow topic was helpful), I managed to pinpoint 3 primary differences:

When the DataFrame is grouped on several categories, apply() sends the entire sub-DataFrames within the function, while transform() sends each column of each sub-DataFrame separately. That’s why columns can’t access values in other columns within transform();
When the input passed to the function is an iterable of a certain length, apply() can still have the output of any length, while transform() has a limitation of having to output an iterable of the same length as the input;
When the function outputs a scalar, apply() returns that scalar, while transform() propagates that scalar to the iterable of the input length.

I conducted a series of experiments that test these three differences on each applicable pandas object type: Series, DataFrame, SeriesGroupBy, and DataFrameGroupBy. I can send my ipynb with the code and the results if necessary, but it would be sufficient to just look at the conclusion for the Series type:

1 – not applicable. In both cases the function has a scalar input. 2 – not applicable. No matter what the function returns, in both cases the result is assigned to the single cell, even if it means entire DataFrames within cells of a Seires. 3 – not applicable. The input length is always “1” (it’s considered “1” even when it’s an iterable), so there’s no need to propagate.

Inapplicability of 1 is self-explanatory. But 2 was a surprise. Below is the code I tried:

import pandas as pd

df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

def return_df(x):
    return pd.DataFrame([[4, 5], [3, 2]])

def return_series(x):
    return pd.Series([1, 2])

df['a'].transform(return_df)
df['a'].transform(return_series)

If you try this code, you’ll see that it doesn’t matter what the function returns. Whatever it is, it will be put inside the single Series cell in its entirety. Is this behavior intentional? It results in the output size being predetermined by the input size, so all the size checks that Series.transform() has within itself become redundant. I can’t imagine any situation where Series.transform() could behave in a different way from Series.apply(). And that raises the question I posed: why does Series.transform() exist?

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

Liam3851commented, Feb 12, 2020

@MarcoGorelli The question was about Series.transform, which does not allow aggregator broadcasting, unlike SeriesGroupBy.transform (for which aggregator broadcasting is the main use case).

In [3]: s1.transform('sum')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-6db3fc2c8d83> in <module>
----> 1 s1.transform('sum')

C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\series.py in transform(self, func, axis, *args, **kwargs)
   3715         # Validate the axis parameter
   3716         self._get_axis_number(axis)
-> 3717         return super().transform(func, *args, **kwargs)
   3718
   3719     def apply(self, func, convert_dtype=True, args=(), **kwds):

C:\Miniconda3\envs\bleeding\lib\site-packages\pandas\core\generic.py in transform(self, func, *args, **kwargs)
  10427         result = self.agg(func, *args, **kwargs)
  10428         if is_scalar(result) or len(result) != len(self):
> 10429             raise ValueError("transforms cannot produce aggregated results")
  10430
  10431         return result

ValueError: transforms cannot produce aggregated results

Admittedly it’s unclear to me why Series.transform does not support aggregator broadcasting, since I thought the point of adding agg, transform, and apply was to mimic the groupby versions.

1reaction

Liam3851commented, Feb 12, 2020

Your observation 1 is wrong. Series.transform can also take a function that takes a Series. Your problem is that in your examples the return values have only two rows while your df had 4.

In [20]: df.a.transform(lambda x: (x - x.mean()) / x.std())
Out[20]:
0    0.439155
1    1.024695
2   -1.317465
3   -0.146385
Name: a, dtype: float64

And you can also do multiple transformers:

In [27]: df.a.transform([np.sqrt, np.exp])
Out[27]:
       sqrt         exp
0  2.000000   54.598150
1  2.236068  148.413159
2  1.000000    2.718282
3  1.732051   20.085537

Neither of those are available with apply.

Top Results From Across the Web

Pandas Series: transform() function - w3resource

The transform() function is used to call function on self producing a Series with transformed values and that has the same axis length...

Understanding the Transform Function in Pandas

As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in...

Fourier transform - Wikipedia

A Fourier transform (FT) is a mathematical transform that decomposes functions into frequency components, which are represented by the output of the ...

Apply vs transform on a group object - Stack Overflow

The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group....

pandas.Series.replace — pandas 1.5.2 documentation

Values of the Series are replaced with other values dynamically. This differs from updating with .loc or .iloc , which require you to...