Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Restoring w/ Adam?

See original GitHub issue

I’m using tf.train.Saver() to save and restore a model. This works, except when training resumes, the accuracy of the model at first continues as it left off, but then suddenly drops completely as if starting from scratch. I checked everything in my code and I don’t know what else could be responsible for it. I am using Adam and I wonder if those variables are not being restored because of some nuance of Sonnet. I can’t figure out why training accuracy drops after a few batches with the restored model.

Issue Analytics

State:
Created 5 years ago
Comments:9

Top GitHub Comments

1reaction

malcolmreynoldscommented, Dec 3, 2018

@slerman12 The example I tested was just a toy example, it wasn’t hooked up to any real data, I was just seeing whether the Adam variables were restored or not.

If it works with a single layer MLP but not with multiple layers that is very strange - what is vocab_size set to? I’m wondering if 256 << vocab_size, whether somehow the bottleneck creates instability in general. You could try making multiple Linears explicitly (with ReLUs between them) - that should be exactly equivalent to the MLP, but without knowing what is wrong in your setup it might give a useful datapoint.

I have a few other ideas but I should clarify they are all guesses.

Is there something in the data input which could be getting restored incorrectly, e.g. do you have a curriculum over difficulty of samples which could be being reset to some initial difficulty, thus ruining training?
Similarly, are you doing RL and learning from a replay buffer which goes back to being empty on restoration?
Does this drop in performance ever occur without a restore/save taking place? In my experience, async optimization can produce sudden drops in performance like this?
Are you using any custom getters?

If you can provide a full self contained example I might be able to offer further advice. However, we definitely haven’t seen anything like this internally, and there are lots of people using something analogous to your setup (Sonnet layers, Adam/ other optimizers with auxiliary variables, tf.Saver) and we haven’t had any reports of anything like this. Using the Google compute infrastructure requires checkpointing, as individual machines may be taken out of service during training, so this would have become an obvious issue by now.

0reactions

slerman12commented, Dec 3, 2018

Aha! My data was not being shuffled upon restore and the model would begin to overfit to highly correlated batches (more MLP layers allowed it to overfit more) - I should have realized. Thank you for your help.