question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement high performance rolling_rank

See original GitHub issue

xref SO issue here

Im looking to set the rolling rank on a dataframe. Having posted, discussed and analysed the code it looks like the suggested way would be to use the pandas Series.rank function as an argument in rolling_apply. However on large datasets the performance is particularly poor. I have tried different implementations and using bottlenecks rank method orders of magnitude faster, but that only offers the average option for ties. It is also still some way off the performance of rolling_mean. I have previously implemented a rolling rank function which monitors changes on a moving window (in a similar way to algos.roll_mean I believe) rather that recalculating the rank from scratch on each window. Below is an example to highlight the performance, it should be possible to implement a rolling rank with comparable performance to rolling_mean.

python: 2.7.3 pandas: 0.15.2 scipy: 0.10.1 bottleneck: 0.7.0

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 #28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 #70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 #3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 #3.56secs (only provides competition/min ranking for ties, not necessarily correct result)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 #0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]
```python

I think this is likely to be a common request for users looking to use pandas for analysis on large datasets and thought it would be a useful addition to the pandas moving statistics/moments suite?

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:20 (7 by maintainers)

github_iconTop GitHub Comments

5reactions
bmpalatiellocommented, Jan 23, 2019

If there is still interest, my workaround for this is:

import bottleneck as bk

norm_rank = bk.move_rank(x.values, n, axis=0)
denorm = (((norm_rank + 1) / 2) * (n - 1)) + 1
descend = (n - denorm) + 1

The bk.move_rank function returns a normalized rank between -1 and 1. So taking the normalized rank and reverse engineering it to return the actual rank. Then essentially making it descending=True. Obviously the only potential downside is it only provides average ranking for ties.

Running it on my small laptop:

window = 240
x = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))

# Original rollingRankBottleneck above
6.04 s ± 302 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This version
411 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0reactions
jrebackcommented, Sep 5, 2020

@contribu thanks for the implementaton. Ideally this would port almost directly to cython and embedded in the current infrastructure. we don’t have very much c++ code in pandas and mostly use cython. if you could do this would be fantastic.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas: Performance For Rolling Rank On Large Dataframes
Is there an easy way to implement a function on a moving window in pandas where I can find the element(s) added and...
Read more >
pandas.core.window.rolling.Rolling.quantile
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j : ......
Read more >
rollingrank - PyPI
rollingrank is a fast implementation of rolling rank transformation (described as the following code). import pandas as pd # x is numpy array...
Read more >
Rolling Rank of a Column in Pandas with min_periods-pandas
High.rolling(window=10, min_periods=3).apply(lambda x: pd.Series(x).rank(ascending=False).values[-1]). Its giving me the same output of base code only.
Read more >
Revisions to the USA Archery Rolling Ranking Formula ...
I want to thank the USAA Athlete Committee and High Performance Manager Robby Beyer for their work and review of this formula.".
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found