BUG: Possible bug when using winsorize on pandas data instead of numpy data
See original GitHub issueDescribe your issue.
When scipy.stats.mstats.winsorize
is used with a nan_policy
of omit
on a numpy array it behaves as expected. However, when used on Pandas data such as a series, all NaNs are converted to the maximum possible value. I believe this may be a bug, but could possibly be an enhancement if pandas data is never meant to be used. An example can be seen in the image below:
Reproducing Code Example
import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
nans_10 = list(np.repeat(np.nan,10))
data = np.array(
nans_10
+ list(range(100))
+ nans_10
)
print(f"Numpy array: {winsorize(data, (0.2, 0.2), nan_policy = 'omit')}")
print(f"Pandas series: {winsorize(pd.Series(data), (0.2, 0.2), nan_policy = 'omit')}")
Error message
No error/warning message is given. The behavior is just unexpected. The pandas version used was 1.4.2.
SciPy/NumPy/Python version information
1.8.0 1.21.6 sys.version_info(major=3, minor=10, micro=2, releaselevel=‘final’, serial=0)
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Possible bug when using winsorize on pandas data instead of ...
When scipy.stats.mstats.winsorize is used with a nan_policy of omit on a numpy array it behaves as expected. However, when used on Pandas data...
Read more >python 3.x - After using scipy.stats.mstats.winsorize to ...
The problem is the inplace operation. Instead assign the column back: for col in df.columns: df[col] = stats.mstats.winsorize(df[col], ...
Read more >BUG: winsorize nan policy · Issue #15660 · scipy/scipy - GitHub
No error, but the last element of the output array is still 80, when it is clearly the largest element in the array....
Read more >The truth value of a series is ambiguous. use a.empty, a.bool ...
This error occurs because the if statement requires a truth value, i.e., a statement evaluating to True or False . In the above...
Read more >scipy.stats.mstats.winsorize — SciPy v1.9.3 Manual
Returns a Winsorized version of the input array. The (limits[0])th lowest values are set to the (limits[0])th percentile, and the (limits[1])th highest values ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The problem was that
_contains_nan
relies on the sum of the array to benan
if there are any NaNs. This is not true for Pandas types, which ignore NaNs by default. I’ll submit a PR.<facepalm>