Restoring w/ Adam?
See original GitHub issueI’m using tf.train.Saver()
to save and restore a model. This works, except when training resumes, the accuracy of the model at first continues as it left off, but then suddenly drops completely as if starting from scratch. I checked everything in my code and I don’t know what else could be responsible for it. I am using Adam and I wonder if those variables are not being restored because of some nuance of Sonnet. I can’t figure out why training accuracy drops after a few batches with the restored model.
Issue Analytics
- State:
- Created 5 years ago
- Comments:9
Top Results From Across the Web
Restoring Adam – REvive. REbuild. REclaim.
WHO ARE WE? Restoring Adam is a rescue mission. A restoration initiative to men and masculinity. A missionary effort to the broken and...
Read more >Restoring What Adam and Eve Lost! - Grace Church
For everything that Adam and Eve did wrong in the garden, Romans chapter 5 says Jesus Christ, the second Adam, came and made...
Read more >Adam, Where Are You?: Restoring Man Back To His Position
Adam, Where Are you, is a life changing, mind renewing guide for men. This book is geared to help men find their purpose,...
Read more >Backup/Restore the ADAM Database - Kaseya HelpDesk
The schedule can stop the ADAM database, backup the data, and then restart the ADAM database. This allows you to restore the database...
Read more >Comparing the Restored and Unrestored Creation of Adam
The 'Creation of Adam' is a famous portion of the Sistine Chapel's ... The newly restored Creation of Adam popped with vibrant colors, ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@slerman12 The example I tested was just a toy example, it wasn’t hooked up to any real data, I was just seeing whether the Adam variables were restored or not.
If it works with a single layer MLP but not with multiple layers that is very strange - what is vocab_size set to? I’m wondering if 256 << vocab_size, whether somehow the bottleneck creates instability in general. You could try making multiple Linears explicitly (with ReLUs between them) - that should be exactly equivalent to the MLP, but without knowing what is wrong in your setup it might give a useful datapoint.
I have a few other ideas but I should clarify they are all guesses.
If you can provide a full self contained example I might be able to offer further advice. However, we definitely haven’t seen anything like this internally, and there are lots of people using something analogous to your setup (Sonnet layers, Adam/ other optimizers with auxiliary variables, tf.Saver) and we haven’t had any reports of anything like this. Using the Google compute infrastructure requires checkpointing, as individual machines may be taken out of service during training, so this would have become an obvious issue by now.
Aha! My data was not being shuffled upon restore and the model would begin to overfit to highly correlated batches (more MLP layers allowed it to overfit more) - I should have realized. Thank you for your help.