Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standardize columns of a dataframe

See original GitHub issue

Is there a function which standardizes data in the columns of a dataframe? If not, can we introduce it?

Through standardization, one can re-scale the data to have a mean of zero and standard deviation of one. I use standardization regularly for two purposes

To facilitate interpretation of regression estimates. Here is a related discussion on Cross Validated.
To facilitate inspection of plots of time-series. Especially when I like to understand whether different series co-move over time, plotting them on the scale helps.

So far, I use a simple function:

def standardize(self, df, label):
    """
    standardizes a series with name ``label'' within the pd.DataFrame
    ``df''.
    """
    df = df.copy(deep=True)
    series = df.loc[:, label]
    avg = series.mean()
    stdv = series.std()
    series_standardized = (series - avg)/ stdv
return series_standardized

I thought if there could be a function standardize which can be used similarly to the rolling function, such as df.standardize().

Issue Analytics

State:
Created 6 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

24reactions

mwaskomcommented, Oct 31, 2017

For the record, the fact that pandas doesn’t handle using scipy.zstats properly here, and so the user needs to write a lambda (for an extremely common operation), remains incredibly annoying.

0reactions

jrebackcommented, Oct 31, 2017

as I wrote before here

In [13]: standarize = lambda x: (x-x.mean()) / x.std()

In [14]: s = pd.Series(np.random.rand(10))
    ...: 
    ...: (s-s.mean())/s.std()
    ...: 
Out[14]: 
0    0.395159
1    0.611805
2   -1.976001
3    0.512755
4    0.954300
5   -0.873228
6   -0.988174
7   -0.099802
8    0.196835
9    1.266350
dtype: float64

In [15]: standarize = lambda x: (x-x.mean()) / x.std()

In [16]: s.pipe(standarize)
Out[16]: 
0    0.395159
1    0.611805
2   -1.976001
3    0.512755
4    0.954300
5   -0.873228
6   -0.988174
7   -0.099802
8    0.196835
9    1.266350
dtype: float64