[BUG] - numpy overflow encountered in reduce
See original GitHub issueThanks for sharing this package, I’m loving it!
I did run into a bug today. When I try to run dist_plot on my dataset, I get the following message:
<snip>\numpy\core_methods.py:160: RuntimeWarning: overflow encountered in reduce
I isolated it down to one particular series in my dataframe. It’s not one I really care about, but maybe someone else will run into it for a series they DO care about. Here’s a describe() after running it through klib’s data_cleaning function:
df.created_at.describe()
count 5.213400e+04
mean 1.610795e+12
std 4.225043e+08
min 1.609891e+12
25% 1.610552e+12
50% 1.610838e+12
75% 1.611198e+12
max 1.611274e+12
Name: created_at, dtype: float64
Meanwhile, info() reports something different:
df.info()
…
2 created_at 52134 non-null float32
…
Notice one reports float32 while the other says float64… Seems fishy.
I’m using miniconda on Windows 10. conda v4.9.2 numpy v1.19.5 klib v0.1.0
If you need me to provide my dataset, I can do so.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
@akanz1 yep that makes sense. You’re right, the analysis of this particular variable is not interesting, so the overflow doesn’t bother me. But who knows, maybe some day someone will run into this with data that IS interesting. 😀 Thanks for taking a look.
@Zalfrin thanks for the data. I was able to reproduce the issue and narrow it down to the computation of the kurtosis using scipy.
scipy.stats.kurtosis(df_cleaned)
If you check your plot created with
klib.dist_plot(df_cleaned)
, you should notice that the kurtosis becomes infinite. I was not able to identify exactly why the calculation of the kurtosis results in a RuntimeWarning using the cleaned_df but not using the original df.Given the overflow warning, i suspect that for the computation of the kurtosis (see here for scipy source) the 32bit float is not large enough to hold intermediary results, since the already large initial values (your UNIX timestamps) are squared.
Ideally, you convert your timestamps to a datetime using
datetime.fromtimestamp(timestamp)
to avoid the overflow. Or simply ignore the warning since the kurtosis likely does not add much value in this situation.I hope this helps!