Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trialling hyper-parameter optimization

See original GitHub issue

@gbaydin This is a question about something I thought might “just work”. Don’t spend time on it, I’m just curious if there’s something obvious I’m doing wrong.

I took a tentative attempt at hyper-parameter optimization for the learning rate of the first 20 minibatches of the GAN sample

https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1

gan.fsx: https://github.com/dsyme/DiffSharp/blob/ef0bcd04575a67636b5557dcc953c4ab8e287598/examples/gan.fsx

The aim is simply to work out what the optimal training learning rate would be if we’re only going to run training on precisely those same 20 minibatches. However my attempt doesn’t work because the derivative of my train function is always zero according to the optimizer, e,g, my printfn addition to the SGD optimizer gives this:

f = tensor(16.0946), g = tensor(0.)

Here f is the sum of the generators losses for the first 20 minibatches (the result of train) and g is the gradient of the train function as reported to SGD. At first I thought this might have been due to noDiff erasing all derivatives. However switching to just a stripDiff that just gets the primal didn’t change things.

Anyway if you can spot anything simple I’m doing wrong on a quick glance it would be instructive for me.

Issue Analytics

State:
Created 3 years ago
Comments:27 (6 by maintainers)

Top GitHub Comments

3reactions

gbaydincommented, Mar 11, 2021

Hi @dsyme it is a good indication that these are working! I expect to look at the results closely in the coming days.

In the meantime, I would like to point you and @awf to the paper I published exactly on this subject at ICLR: https://arxiv.org/abs/1703.04782 In the paper we address many of the comments you’ve been raising. It took a lot of effort to explain this technique to the reviewers and the community at ICLR, so the paper is the best summary of everything around this subject so far. I hope you and @awf can have a careful look at it (if you have a bit of time, of course!).

I think two most important comments are:

There are online and offline varieties of hypergradient-based hyperparameter optimization. These have completely different complexities and computational cost. What you’ve been doing so far is the offline variety (I say this after a quick look at your code), where you propagate hypergradients through whole optimization sessions. This is in effect equivalent to what Maclaurin et al. did in their ICML paper: https://arxiv.org/abs/1502.03492 (By the way some of the examples in that paper would be really cool to implement in diffsharp!)
My ICLR paper I linked above is mainly about the online variety which has lots of advantages and it’s actually quite cool. The most important thing to note is that it is not a direct competitor to something like a massive grid search. It’s actually something more attractive than that. We provide something that, without any massive, costly grid-based tuning (i.e., running just a single training run where you jointly optimize the base model parameters and the hyperparameters, in other words parameter update and hyperparameter update happening together in each iteration), does almost as good as what one can achieve after a massive, costly grid-based tuning. This significantly reduces the need for hyperparameter tuning in practice, and people indeed use our technique a lot in many settings (some of them pretty large-scale). See especially Figure 2, 4 and associated discussions in the paper. This repo has the poster that can also help to quickly get the point I’m talking about: https://github.com/gbaydin/hypergradient-descent

We should implement the online variety as well. There are also lots of research directions that follow from this and some student projects that are currently ongoing on my side.

As a note: the paper initially started as result of a few quick experiments we did with DiffSharp 0.7 at that time. 😃

1reaction

gbaydincommented, Mar 12, 2021

@dsyme I wouldn’t care too much about a small difference between these two curves beyond 2k iterations. I think in this example, the choice of 0.001 was already very close to optimal. I would, for example, look at the adaptive/non-adaptive behavior for other learning rates such as 0.1, 0.01, 0.0001, 0.00001 etc. and see if the adaptive versions do a good enough job to bring the learning rate towards the optimal value and the loss towards the optimal. If they do, it’s a winning result for some contexts because it shows you can adapt the learning rate on the fly to do much better than the non-adaptive algorithm, without any need for costly grid search. Depending on the setting, minute differences are not always important as there is never a magic method that is guaranteed to give you “lower than everything else everywhere”. Optimization is a huge field and this thing you’re expecting is kind of a holy grail.

If you’re in a setting where you really want to make sure you have the absolute best loss and minute differences are important, you would definitely need the costly grid search, Bayesian optimization or other stuff. But this adaptive algorithm can even help with that because it can give you an indication of where to start your costly searches from, so you wouldn’t need to cover a huge costly grid.

A general note: these two curves are just two individual realizations of running these two algorithms. I would definitely be interested in looking at this too: for each curve (orange and blue) I would run exactly the same experiment a handful of times (like ten times) with different random number seeds, and I would plot the mean and standard deviation of these curves to get a better picture of the expected behaviors. Looking at individual single realizations in stochastic optimization results can be misleading. (I actually want to add the shaded region plotting thing to the diffsharp pyplot wrapper for this purpose 😃 )

Top Results From Across the Web

State-of-the-Art Machine Learning Hyperparameter ...

Optuna enables users to adopt state-of-the-art algorithms for sampling hyperparameters and pruning unpromising trials. This helps to speed up ...

Efficient Hyperparameter Optimization with Optuna ...

Hyperparameters optimization is an essential part of machine learning and deep learning projects. Manually selecting the best hyperparameters is not easy.

Optuna Guide: How to Monitor Hyper-Parameter ...

We define a function, in this case objective, which takes an object called trial. This trial object is used to construct a model...

Efficient Optimization Algorithms — Optuna 3.3.0 documentation

Optuna enables efficient hyperparameter optimization by adopting state-of-the-art algorithms for sampling hyperparameters and pruning efficiently unpromising ...

Understanding of Optuna-A Machine Learning ...

Hyperparameter optimization is one of the crucial steps in training Machine Learning models. With many parameters to optimize, long training ...