Calculating for high variance when the mean of cv_scores is zero
See original GitHub issueRight now, our calculation for high variance is:
cv_scores_std = cv_scores.std()
cv_scores_mean = cv_scores.mean()
if cv_scores_std != 0 and cv_scores_mean != 0:
high_variance_cv = bool(abs(cv_scores_std / cv_scores_mean) > threshold)```
This was put in place to prevent a divide-by-zero warning / error. However, this impl is not ideal: if the mean is close to zero (but not zero), high variance will likely be true (dividing by small number).
In addition, it is possible for there to be high variance if the mean is zero but the standard deviation is very high. Right now, we default to False but we may want to look for a impl that is smarter in this situation.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
Coefficient of Variation in Statistics - Statistics By Jim
Calculating the coefficient of variation involves a simple ratio. Simply take the standard deviation and divide it by the mean.
Read more >Calculating the variation coefficient when the arithmetic mean ...
When evaluating σ all the values are being squared and become positive thus the only way σ can be zero is if x1=x2=x3=...=xn...
Read more >Machine Learning in Python: Intermediate - Course 5/8
Append to arrays to do calculate overall average mse and variance ... On interpreting CV scores, your idea of high variance of errors...
Read more >Messy modelling: overfitting, cross-validation, and the bias-variance ...
Messy modelling: overfitting, cross-validation, and the bias-variance trade- ... variable we'd like to predict (here, 0 means 'non-spam', 1 means 'spam'):.
Read more >overfitting, cross-validation, and the bias-variance trade-off
This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@angela97lin Thanks for explaining! The problem of dividing by zero only happens with the objectives that can take on negative values. For those objectives, I wonder if our use of the coefficient of variation is even valid?
I looked at the wiki, and the definition section mentions that it should only be calculated for data on a
ratio scale
, i.e. data measured on a scale with a “meaningful zero”. I take that to mean that the data can’t be negative. This stats stackexchange post seems to support that interpretation.With that in mind, I guess I see four options:
What do you think?
@freddyaboulton Ah sorry, the order of the PRs is a bit confusing. You’re right, #2024 handled the case where mean is zero so that we don’t trigger the the divide by zero issue. What I wanted this PR issue to track was a better way to calculate / handle when the mean is zero, since right now we’re defaulting to False. Will update the description :d