Hyperband stopping criterion too permissive with noisy metrics
See original GitHub issuewandb --version && python --version && uname
- Weights and Biases version: 0.9.7
- Python version: 3.7.9
- Operating System: Ubuntu 18.04LTS
Description
Running a sweep using Hyperband (HB) early stopping criterion. The metric I use to evaluate model quality is somewhat noisy. The current hyperband stopping criterion considers the run’s minimum (best) (code) metric value vs a single-sample snapshot of the other runs in that band. (code). For noisy metrics, the result is that runs are judged very favorably, and runs that a human would obviously judge as being worse than average are allowed to continue. This problem becomes especially pronounced in later bands where the min()
is taken over a larger number of samples.
What I Did
Consider these charts of the objective metric vs time for a HB run.
The run I’m concerned about is in hot pink, with the badly drawn label “This V”. It passed a band marker at about 3.7 hours where the two brownish runs at the bottom finished. This smoothed plot makes it clear that the hot pink run is not in the better half of all runs to reach that point, and thus should be stopped. But because it got one lucky sample below the median, it is allowed to continue, as shown in this unsmoothed plot:
Discussion
It’s not totally clear what to do about this. But the current criterion is clearly biased in a way that leads to wasted compute because it stops a lot fewer runs than it should when the metrics are noisy. Using single sample snapshots for both the current run and the thresholds I think would yield an unbiased criterion with high variance. Alternately using the same aggregation for both could also eliminate the bias. I’m not sure which would do better in general.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (2 by maintainers)
Top GitHub Comments
Sorry - I totally missed this and it really should compare the snapshot - let me take a look.
The mean or a smoothed option seems like a good idea, although maybe simpler and more flexible is to let the user do the aggregation. Thanks @leopd.
Sorry - I realize a mistake I made in the original bug report. I shouldn’t have referred to human judgement, because that implies that the algorithm is doing something silly that an expert wouldn’t - this isn’t the problem. The algorithm as implemented is objectively incorrect.
At its core, the problem is that it’s making an unfair comparison - a single metric vs the min of a series of metrics. The bias means it fails to stop half (or 1/eta) the jobs like it should. The question is how to fix it. Compare single value against single value at the instant it reaches the new band? That would be fair, and simple, but also very noisy. It achieves the key goal of stopping half the runs, and so all the very long proofs in the paper would apply, but the variance would be very high, meaning that the convergence to a good configuration would be slow. A lower variance alternative would be to aggregate the metrics within the band in the same way for every run. Min vs min could work, but that will also have high variance (higher?), as it emphasizes outliers.
I think mean is probably better, since it will lead to the lowest variance assuming normal noise. Of course these samples are expected to have some systematic progression within the band towards better metric values. Still, the mean will reduce variance and will be as fair a comparison between runs as anything. When I say the metrics within the band, I mean just the new metrics since the end of the last band. So if you band boundaries are {2,4,8}, then when aggregating metrics for band=8, you’d use the metrics at times 5 through 8. No reason to include the earlier history, since the runs already passed that criterion.
Here’s another example where the current implementation will fail badly, and applies even if the objective metric has zero noise: if the training processes start to overfit, such that the metrics of all runs bottom out and then start to go back up (get worse). In this situation, the current algorithm would never stop a large class of runs which have started to overfit. Because it would always consider the best metric the run had ever achieved, and compare that against the larger overfitting metric of all the other runs (including itself) which are going back up in the later bands. The minimum metric value is going to be better than what all the runs are doing later, and so the current implementation will judge every run favorably and never stop them.