question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: rolling std can't handle mixture of (relatively) big and small numbers

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


import pandas as pd
import numpy as np

df = pd.DataFrame(np.zeros(1000))
df.iloc[0] = 1000
df.rolling(10).std()

Out[124]: 
            0
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
..        ...
995  0.000004
996  0.000004
997  0.000004
998  0.000004
999  0.000004
[1000 rows x 1 columns]

Problem description

If we have a relatively big number at the top, and the remaining is zeros, then the rolling result is not zero

Expected Output

df.iloc[0] = 0
df.rolling(10).std()

Out[125]: 
       0
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
..   ...
995  0.0
996  0.0
997  0.0
998  0.0
999  0.0
[1000 rows x 1 columns]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-53-generic Version : #60-Ubuntu SMP Thu May 6 07:46:32 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : None pytest : 6.2.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : None pymysql : 0.9.3 psycopg2 : 2.8.6 (dt dec pq3 ext lo64) jinja2 : 3.0.0 IPython : 7.23.1 pandas_datareader: 0.9.0 bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.0 pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.15 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.53.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Jun 17, 2021

Note: this is called out in the user guide in the warning block https://pandas.pydata.org/docs/user_guide/window.html#overview

and even thought 1.3 does have a change in rolling.std/var, we note that there will be numerical imprecision as well: https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.3.0.html#removed-artificial-truncation-in-rolling-variance-and-standard-deviation

There will be likely be no definitive fix for this given the algorithm we’re using to calculate rolling aggregations

0reactions
audersoncommented, Mar 19, 2022

So in short, the new algo will record and count how many same values have appeared:

# in `add_var`
if val == prev_value[0] and val != MAXfloat64 and val != MINfloat64:
    n_same_value[0] += 1  # incr count
else:
    n_same_value[0] = 1  # if not same, reset count to 1 (include itself)

prev_value[0] = val  # store prev value

If values are the same, we use 0 as the result, instead of _ssqdm_x / (_nobs - ddof)

# in `calc_var`
if (nobs >= minp) and (nobs > ddof):

    # pathological case & repeatedly same values case
    if nobs == 1 or num_consecutive_same_value >= nobs:
        result = 0
    else:
        result = ssqdm_x / (nobs - <float64_t>ddof)
else:
    result = NaN

The floating point artifacts comes from _ssqdm_x / (_nobs - ddof), and if we know the values are actually the same, I think it’s safe to use 0 as the result (instead of 2.91038305e-11 from the example)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Central Limit Theorem Explained - Statistics By Jim
The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable...
Read more >
You roll a die 1000 times, you add up the numbers ... - Quora
Before we get started let's do a sanity check to see if it is possible 1000 die rolls. The minimum sum is 1000,...
Read more >
Answering questions with data - 2 Describing Data - crumplab
221 ‑475 500 1236 241 649 124 220 609 1615 465 432 355 490 1685 ‑205 ‑443 ‑27 24 ‑705 ‑232 157 ‑589 309 810...
Read more >
GLMM FAQ - GitHub Pages
Treating factors with small numbers of levels as random will in the ... Relatively few mixed effect modeling packages can handle crossed ...
Read more >
Relative Standard Deviation - an overview - ScienceDirect.com
1 Population balance equation. Since the key critical quality attribute in batch blending is the blend uniformity represented by RSD, methods for modeling ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found