question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Restoring w/ Adam?

See original GitHub issue

I’m using tf.train.Saver() to save and restore a model. This works, except when training resumes, the accuracy of the model at first continues as it left off, but then suddenly drops completely as if starting from scratch. I checked everything in my code and I don’t know what else could be responsible for it. I am using Adam and I wonder if those variables are not being restored because of some nuance of Sonnet. I can’t figure out why training accuracy drops after a few batches with the restored model.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
malcolmreynoldscommented, Dec 3, 2018

@slerman12 The example I tested was just a toy example, it wasn’t hooked up to any real data, I was just seeing whether the Adam variables were restored or not.

If it works with a single layer MLP but not with multiple layers that is very strange - what is vocab_size set to? I’m wondering if 256 << vocab_size, whether somehow the bottleneck creates instability in general. You could try making multiple Linears explicitly (with ReLUs between them) - that should be exactly equivalent to the MLP, but without knowing what is wrong in your setup it might give a useful datapoint.

I have a few other ideas but I should clarify they are all guesses.

  • Is there something in the data input which could be getting restored incorrectly, e.g. do you have a curriculum over difficulty of samples which could be being reset to some initial difficulty, thus ruining training?
  • Similarly, are you doing RL and learning from a replay buffer which goes back to being empty on restoration?
  • Does this drop in performance ever occur without a restore/save taking place? In my experience, async optimization can produce sudden drops in performance like this?
  • Are you using any custom getters?

If you can provide a full self contained example I might be able to offer further advice. However, we definitely haven’t seen anything like this internally, and there are lots of people using something analogous to your setup (Sonnet layers, Adam/ other optimizers with auxiliary variables, tf.Saver) and we haven’t had any reports of anything like this. Using the Google compute infrastructure requires checkpointing, as individual machines may be taken out of service during training, so this would have become an obvious issue by now.

0reactions
slerman12commented, Dec 3, 2018

Aha! My data was not being shuffled upon restore and the model would begin to overfit to highly correlated batches (more MLP layers allowed it to overfit more) - I should have realized. Thank you for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Restoring Adam – REvive. REbuild. REclaim.
WHO ARE WE? Restoring Adam is a rescue mission. A restoration initiative to men and masculinity. A missionary effort to the broken and...
Read more >
Restoring What Adam and Eve Lost! - Grace Church
For everything that Adam and Eve did wrong in the garden, Romans chapter 5 says Jesus Christ, the second Adam, came and made...
Read more >
Adam, Where Are You?: Restoring Man Back To His Position
Adam, Where Are you, is a life changing, mind renewing guide for men. This book is geared to help men find their purpose,...
Read more >
Backup/Restore the ADAM Database - Kaseya HelpDesk
The schedule can stop the ADAM database, backup the data, and then restart the ADAM database. This allows you to restore the database...
Read more >
Comparing the Restored and Unrestored Creation of Adam
The 'Creation of Adam' is a famous portion of the Sistine Chapel's ... The newly restored Creation of Adam popped with vibrant colors, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found