BUG: rolling std can't handle mixture of (relatively) big and small numbers
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros(1000))
df.iloc[0] = 1000
df.rolling(10).std()
Out[124]:
0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
.. ...
995 0.000004
996 0.000004
997 0.000004
998 0.000004
999 0.000004
[1000 rows x 1 columns]
Problem description
If we have a relatively big number at the top, and the remaining is zeros, then the rolling result is not zero
Expected Output
df.iloc[0] = 0
df.rolling(10).std()
Out[125]:
0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
.. ...
995 0.0
996 0.0
997 0.0
998 0.0
999 0.0
[1000 rows x 1 columns]
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.8.0-53-generic Version : #60-Ubuntu SMP Thu May 6 07:46:32 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 52.0.0.post20210125 Cython : None pytest : 6.2.3 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.2 html5lib : None pymysql : 0.9.3 psycopg2 : 2.8.6 (dt dec pq3 ext lo64) jinja2 : 3.0.0 IPython : 7.23.1 pandas_datareader: 0.9.0 bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 4.0.0 pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.15 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : 0.53.1
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:9 (8 by maintainers)
Note: this is called out in the user guide in the warning block https://pandas.pydata.org/docs/user_guide/window.html#overview
and even thought 1.3 does have a change in
rolling.std/var
, we note that there will be numerical imprecision as well: https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.3.0.html#removed-artificial-truncation-in-rolling-variance-and-standard-deviationThere will be likely be no definitive fix for this given the algorithm we’re using to calculate rolling aggregations
So in short, the new algo will record and count how many same values have appeared:
If values are the same, we use 0 as the result, instead of
_ssqdm_x / (_nobs - ddof)
The floating point artifacts comes from
_ssqdm_x / (_nobs - ddof)
, and if we know the values are actually the same, I think it’s safe to use 0 as the result (instead of 2.91038305e-11 from the example)