Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calculating for high variance when the mean of cv_scores is zero

See original GitHub issue

Right now, our calculation for high variance is:

        cv_scores_std = cv_scores.std()
        cv_scores_mean = cv_scores.mean()
        if cv_scores_std != 0 and cv_scores_mean != 0:
            high_variance_cv = bool(abs(cv_scores_std / cv_scores_mean) > threshold)```

This was put in place to prevent a divide-by-zero warning / error. However, this impl is not ideal: if the mean is close to zero (but not zero), high variance will likely be true (dividing by small number).

In addition, it is possible for there to be high variance if the mean is zero but the standard deviation is very high. Right now, we default to False but we may want to look for a impl that is smarter in this situation.

Issue Analytics

State:
Created 2 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

2reactions

freddyaboultoncommented, Apr 8, 2021

@angela97lin Thanks for explaining! The problem of dividing by zero only happens with the objectives that can take on negative values. For those objectives, I wonder if our use of the coefficient of variation is even valid?

I looked at the wiki, and the definition section mentions that it should only be calculated for data on a ratio scale, i.e. data measured on a scale with a “meaningful zero”. I take that to mean that the data can’t be negative. This stats stackexchange post seems to support that interpretation.

With that in mind, I guess I see four options:

If the mean is <=0, don’t calculate the coefficient of variation. We would never raise the warning in this situation.
Add an epsilon to the denominator. I guess it depends on the value of epsilon but basically we would always raise the warning if the mean is zero.
Don’t change anything. We would never raise the warning when the mean == 0.
Find a different measure of overfitting that accounts for objectives that can take on negative values.

What do you think?

1reaction

angela97lincommented, Apr 8, 2021

@freddyaboulton Ah sorry, the order of the PRs is a bit confusing. You’re right, #2024 handled the case where mean is zero so that we don’t trigger the the divide by zero issue. What I wanted this PR issue to track was a better way to calculate / handle when the mean is zero, since right now we’re defaulting to False. Will update the description :d

Top Results From Across the Web

Coefficient of Variation in Statistics - Statistics By Jim

Calculating the coefficient of variation involves a simple ratio. Simply take the standard deviation and divide it by the mean.

Calculating the variation coefficient when the arithmetic mean ...

When evaluating σ all the values are being squared and become positive thus the only way σ can be zero is if x1=x2=x3=...=xn...

Machine Learning in Python: Intermediate - Course 5/8

Append to arrays to do calculate overall average mse and variance ... On interpreting CV scores, your idea of high variance of errors...

Messy modelling: overfitting, cross-validation, and the bias-variance ...

Messy modelling: overfitting, cross-validation, and the bias-variance trade- ... variable we'd like to predict (here, 0 means 'non-spam', 1 means 'spam'):.

overfitting, cross-validation, and the bias-variance trade-off

This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances....