Trialling hyper-parameter optimization
See original GitHub issue@gbaydin This is a question about something I thought might “just work”. Don’t spend time on it, I’m just curious if there’s something obvious I’m doing wrong.
I took a tentative attempt at hyper-parameter optimization for the learning rate of the first 20 minibatches of the GAN sample
https://github.com/DiffSharp/DiffSharp/compare/examples/vae-gan-hypopt?expand=1
gan.fsx: https://github.com/dsyme/DiffSharp/blob/ef0bcd04575a67636b5557dcc953c4ab8e287598/examples/gan.fsx
The aim is simply to work out what the optimal training learning rate would be if we’re only going to run training on precisely those same 20 minibatches. However my attempt doesn’t work because the derivative of my train
function is always zero according to the optimizer, e,g, my printfn addition to the SGD optimizer gives this:
f = tensor(16.0946), g = tensor(0.)
Here f
is the sum of the generators losses for the first 20 minibatches (the result of train
) and g
is the gradient of the train
function as reported to SGD. At first I thought this might have been due to noDiff
erasing all derivatives. However switching to just a stripDiff
that just gets the primal didn’t change things.
Anyway if you can spot anything simple I’m doing wrong on a quick glance it would be instructive for me.
Issue Analytics
- State:
- Created 3 years ago
- Comments:27 (6 by maintainers)
Top GitHub Comments
Hi @dsyme it is a good indication that these are working! I expect to look at the results closely in the coming days.
In the meantime, I would like to point you and @awf to the paper I published exactly on this subject at ICLR: https://arxiv.org/abs/1703.04782 In the paper we address many of the comments you’ve been raising. It took a lot of effort to explain this technique to the reviewers and the community at ICLR, so the paper is the best summary of everything around this subject so far. I hope you and @awf can have a careful look at it (if you have a bit of time, of course!).
I think two most important comments are:
We should implement the online variety as well. There are also lots of research directions that follow from this and some student projects that are currently ongoing on my side.
As a note: the paper initially started as result of a few quick experiments we did with DiffSharp 0.7 at that time. 😃
@dsyme I wouldn’t care too much about a small difference between these two curves beyond 2k iterations. I think in this example, the choice of 0.001 was already very close to optimal. I would, for example, look at the adaptive/non-adaptive behavior for other learning rates such as 0.1, 0.01, 0.0001, 0.00001 etc. and see if the adaptive versions do a good enough job to bring the learning rate towards the optimal value and the loss towards the optimal. If they do, it’s a winning result for some contexts because it shows you can adapt the learning rate on the fly to do much better than the non-adaptive algorithm, without any need for costly grid search. Depending on the setting, minute differences are not always important as there is never a magic method that is guaranteed to give you “lower than everything else everywhere”. Optimization is a huge field and this thing you’re expecting is kind of a holy grail.
If you’re in a setting where you really want to make sure you have the absolute best loss and minute differences are important, you would definitely need the costly grid search, Bayesian optimization or other stuff. But this adaptive algorithm can even help with that because it can give you an indication of where to start your costly searches from, so you wouldn’t need to cover a huge costly grid.
A general note: these two curves are just two individual realizations of running these two algorithms. I would definitely be interested in looking at this too: for each curve (orange and blue) I would run exactly the same experiment a handful of times (like ten times) with different random number seeds, and I would plot the mean and standard deviation of these curves to get a better picture of the expected behaviors. Looking at individual single realizations in stochastic optimization results can be misleading. (I actually want to add the shaded region plotting thing to the diffsharp pyplot wrapper for this purpose 😃 )