question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: Possible bug when using winsorize on pandas data instead of numpy data

See original GitHub issue

Describe your issue.

When scipy.stats.mstats.winsorize is used with a nan_policy of omit on a numpy array it behaves as expected. However, when used on Pandas data such as a series, all NaNs are converted to the maximum possible value. I believe this may be a bug, but could possibly be an enhancement if pandas data is never meant to be used. An example can be seen in the image below:

image

Reproducing Code Example

import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize

nans_10 = list(np.repeat(np.nan,10))
data = np.array(
	nans_10
	+ list(range(100))
	+ nans_10
)


print(f"Numpy array: {winsorize(data, (0.2, 0.2), nan_policy = 'omit')}")

print(f"Pandas series: {winsorize(pd.Series(data), (0.2, 0.2), nan_policy = 'omit')}")

Error message

No error/warning message is given. The behavior is just unexpected. The pandas version used was 1.4.2.

SciPy/NumPy/Python version information

1.8.0 1.21.6 sys.version_info(major=3, minor=10, micro=2, releaselevel=‘final’, serial=0)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mdhabercommented, May 30, 2022

The problem was that _contains_nan relies on the sum of the array to be nan if there are any NaNs. This is not true for Pandas types, which ignore NaNs by default. I’ll submit a PR.

1reaction
mdhabercommented, May 18, 2022

<facepalm>

Read more comments on GitHub >

github_iconTop Results From Across the Web

Possible bug when using winsorize on pandas data instead of ...
When scipy.stats.mstats.winsorize is used with a nan_policy of omit on a numpy array it behaves as expected. However, when used on Pandas data...
Read more >
python 3.x - After using scipy.stats.mstats.winsorize to ...
The problem is the inplace operation. Instead assign the column back: for col in df.columns: df[col] = stats.mstats.winsorize(df[col], ...
Read more >
BUG: winsorize nan policy · Issue #15660 · scipy/scipy - GitHub
No error, but the last element of the output array is still 80, when it is clearly the largest element in the array....
Read more >
The truth value of a series is ambiguous. use a.empty, a.bool ...
This error occurs because the if statement requires a truth value, i.e., a statement evaluating to True or False . In the above...
Read more >
scipy.stats.mstats.winsorize — SciPy v1.9.3 Manual
Returns a Winsorized version of the input array. The (limits[0])th lowest values are set to the (limits[0])th percentile, and the (limits[1])th highest values ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found