BUG: rolling() function does not work with Float64 columns with missing values
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Simple series with floats, dtype float64:
>>> df = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0])
>>> df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
diff() introduces a NaN in the first position:
>>> df.diff()
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
dtype: float64
This works as expected:
>>> df.diff().rolling(2).sum()
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
We can cast to a Float64:
>>> df.astype('Float64')
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: Float64
diff() still works fine, but gives us a pd.NA instead of a NaN:
>>> df.astype('Float64').diff()
0 <NA>
1 1.0
2 1.0
3 1.0
4 1.0
dtype: Float64
Now, when we call rolling(), everything comes tumbling down:
>>> df.astype('Float64').diff().rolling(2).sum()
Traceback (most recent call last):
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 321, in _prep_values
values = ensure_float64(values)
File "pandas/_libs/algos_common_helper.pxi", line 45, in pandas._libs.algos.ensure_float64
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/arrays/masked.py", line 335, in __array__
return self.to_numpy(dtype=dtype)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/arrays/masked.py", line 292, in to_numpy
raise ValueError(
ValueError: cannot convert to 'float64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 402, in _apply_series
values = self._prep_values(obj._values)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 323, in _prep_values
raise TypeError(f"cannot handle this type -> {values.dtype}") from err
TypeError: cannot handle this type -> Float64
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 1723, in sum
return super().sum(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 1233, in sum
return self._apply(window_func, name="sum", **kwargs)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 539, in _apply
return self._apply_blockwise(homogeneous_func, name)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 417, in _apply_blockwise
return self._apply_series(homogeneous_func, name)
File "/home/janlugt/repositories/streaming_anomaly_detection/.venv/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 404, in _apply_series
raise DataError("No numeric types to aggregate") from err
pandas.core.base.DataError: No numeric types to aggregate
Problem description
My expectation would be that whether we call rolling on a float64 or a Float64 series, the output would be the same. The promise of pd.NA is that it can be used consistently across data types, and make the use of specific NaN types such as np.nan or NaT unnecessary, which is clearly not the case here.
Expected Output
>>> df.astype('Float64').diff().rolling(2).sum()
0 <NA>
1 <NA>
2 2.0
3 2.0
4 2.0
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 5.10.16.3-microsoft-standard-WSL2 Version : #1 SMP Fri Apr 2 22:23:49 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.3.2 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.2.4 setuptools : 57.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (6 by maintainers)
@drtoche
fixed by #43174, 1.4.0 release candidate expected next month with release in early January. Please feel free to evaluate the release candidate when available or checkout the nightly builds to confirm this fixes the issue.
https://anaconda.org/scipy-wheels-nightly/pandas/files
Hi,
I think this bug still exists (in
pandas
1.5.1) but for.ewm
. See example below