Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hyperband stopping criterion too permissive with noisy metrics

See original GitHub issue

wandb --version && python --version && uname

Weights and Biases version: 0.9.7
Python version: 3.7.9
Operating System: Ubuntu 18.04LTS

Description

Running a sweep using Hyperband (HB) early stopping criterion. The metric I use to evaluate model quality is somewhat noisy. The current hyperband stopping criterion considers the run’s minimum (best) (code) metric value vs a single-sample snapshot of the other runs in that band. (code). For noisy metrics, the result is that runs are judged very favorably, and runs that a human would obviously judge as being worse than average are allowed to continue. This problem becomes especially pronounced in later bands where the min() is taken over a larger number of samples.

What I Did

Consider these charts of the objective metric vs time for a HB run.

smoothed objective metrics

The run I’m concerned about is in hot pink, with the badly drawn label “This V”. It passed a band marker at about 3.7 hours where the two brownish runs at the bottom finished. This smoothed plot makes it clear that the hot pink run is not in the better half of all runs to reach that point, and thus should be stopped. But because it got one lucky sample below the median, it is allowed to continue, as shown in this unsmoothed plot:

unsmoothed objective metrics

Discussion

It’s not totally clear what to do about this. But the current criterion is clearly biased in a way that leads to wasted compute because it stops a lot fewer runs than it should when the metrics are noisy. Using single sample snapshots for both the current run and the thresholds I think would yield an unbiased criterion with high variance. Alternately using the same aggregation for both could also eliminate the bias. I’m not sure which would do better in general.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:12 (2 by maintainers)

Top GitHub Comments

1reaction

lukascommented, Feb 26, 2021

Sorry - I totally missed this and it really should compare the snapshot - let me take a look.

The mean or a smoothed option seems like a good idea, although maybe simpler and more flexible is to let the user do the aggregation. Thanks @leopd.

1reaction

leopdcommented, Sep 14, 2020

Sorry - I realize a mistake I made in the original bug report. I shouldn’t have referred to human judgement, because that implies that the algorithm is doing something silly that an expert wouldn’t - this isn’t the problem. The algorithm as implemented is objectively incorrect.

At its core, the problem is that it’s making an unfair comparison - a single metric vs the min of a series of metrics. The bias means it fails to stop half (or 1/eta) the jobs like it should. The question is how to fix it. Compare single value against single value at the instant it reaches the new band? That would be fair, and simple, but also very noisy. It achieves the key goal of stopping half the runs, and so all the very long proofs in the paper would apply, but the variance would be very high, meaning that the convergence to a good configuration would be slow. A lower variance alternative would be to aggregate the metrics within the band in the same way for every run. Min vs min could work, but that will also have high variance (higher?), as it emphasizes outliers.

I think mean is probably better, since it will lead to the lowest variance assuming normal noise. Of course these samples are expected to have some systematic progression within the band towards better metric values. Still, the mean will reduce variance and will be as fair a comparison between runs as anything. When I say the metrics within the band, I mean just the new metrics since the end of the last band. So if you band boundaries are {2,4,8}, then when aggregating metrics for band=8, you’d use the metrics at times 5 through 8. No reason to include the earlier history, since the runs already passed that criterion.

Here’s another example where the current implementation will fail badly, and applies even if the objective metric has zero noise: if the training processes start to overfit, such that the metrics of all runs bottom out and then start to go back up (get worse). In this situation, the current algorithm would never stop a large class of runs which have started to overfit. Because it would always consider the best metric the run had ever achieved, and compare that against the larger overfitting metric of all the other runs (including itself) which are going back up in the later bands. The minimum metric value is going to be better than what all the runs are doing later, and so the current implementation will judge every run favorably and never stop them.

Top Results From Across the Web

Hyperband stopping criterion too permissive with noisy metrics

Running a sweep using Hyperband (HB) early stopping criterion. The metric I use to evaluate model quality is somewhat noisy. The current ...

Trial Schedulers (tune.schedulers) — Ray 2.2.0

Implements the HyperBand early stopping algorithm. HyperBandScheduler early stops trials using the HyperBand optimization algorithm. It divides trials into ...

OpenBox: A Generalized Black-box Optimization Service

A median stopping criterion is also included in Vizier [30], RayTune [14], and OpenBox [31] to halt the assessments early.

HPOBench: A Collection of Reproducible Multi-Fidelity ... - arXiv

A popular multi-fidelity HPO approach that discretizes the fidelity space is Hyperband (HB [19]), a very simple method with strong empirical ...

HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark ...

YAHPO Gym – Design Criteria and a new Multifidelity Benchmark for ... popular multi-fidelity approaches is Hyperband (HB, (Li et al., 2018) ),...