question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calculating for high variance when the mean of cv_scores is zero

See original GitHub issue

Right now, our calculation for high variance is:

        cv_scores_std = cv_scores.std()
        cv_scores_mean = cv_scores.mean()
        if cv_scores_std != 0 and cv_scores_mean != 0:
            high_variance_cv = bool(abs(cv_scores_std / cv_scores_mean) > threshold)```

This was put in place to prevent a divide-by-zero warning / error. However, this impl is not ideal: if the mean is close to zero (but not zero), high variance will likely be true (dividing by small number).

In addition, it is possible for there to be high variance if the mean is zero but the standard deviation is very high. Right now, we default to False but we may want to look for a impl that is smarter in this situation.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
freddyaboultoncommented, Apr 8, 2021

@angela97lin Thanks for explaining! The problem of dividing by zero only happens with the objectives that can take on negative values. For those objectives, I wonder if our use of the coefficient of variation is even valid?

I looked at the wiki, and the definition section mentions that it should only be calculated for data on a ratio scale, i.e. data measured on a scale with a “meaningful zero”. I take that to mean that the data can’t be negative. This stats stackexchange post seems to support that interpretation.

With that in mind, I guess I see four options:

  1. If the mean is <=0, don’t calculate the coefficient of variation. We would never raise the warning in this situation.
  2. Add an epsilon to the denominator. I guess it depends on the value of epsilon but basically we would always raise the warning if the mean is zero.
  3. Don’t change anything. We would never raise the warning when the mean == 0.
  4. Find a different measure of overfitting that accounts for objectives that can take on negative values.

What do you think?

1reaction
angela97lincommented, Apr 8, 2021

@freddyaboulton Ah sorry, the order of the PRs is a bit confusing. You’re right, #2024 handled the case where mean is zero so that we don’t trigger the the divide by zero issue. What I wanted this PR issue to track was a better way to calculate / handle when the mean is zero, since right now we’re defaulting to False. Will update the description :d

Read more comments on GitHub >

github_iconTop Results From Across the Web

Coefficient of Variation in Statistics - Statistics By Jim
Calculating the coefficient of variation involves a simple ratio. Simply take the standard deviation and divide it by the mean.
Read more >
Calculating the variation coefficient when the arithmetic mean ...
When evaluating σ all the values are being squared and become positive thus the only way σ can be zero is if x1=x2=x3=...=xn...
Read more >
Machine Learning in Python: Intermediate - Course 5/8
Append to arrays to do calculate overall average mse and variance ... On interpreting CV scores, your idea of high variance of errors...
Read more >
Messy modelling: overfitting, cross-validation, and the bias-variance ...
Messy modelling: overfitting, cross-validation, and the bias-variance trade- ... variable we'd like to predict (here, 0 means 'non-spam', 1 means 'spam'):.
Read more >
overfitting, cross-validation, and the bias-variance trade-off
This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found