question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When I try to restore weights from checkpoint, following Score2perf's README, checkpoint doesn't have some keys

See original GitHub issue

Ubuntu 18.04 Python 3.7.9 Tensorflow 2.3.1

When I follow https://github.com/magenta/magenta/blob/master/magenta/models/score2perf/README.md, checkpoint doesn’t have some keys The problem happens when I follow Training and Sampling from the model part.

This issue is actually very similar to https://github.com/magenta/magenta/issues/1647. Now in score2perf code imports tf.compat.v1 as tf, so it doesn’t matter that I use tensorflow 2.

The Training command is like below at the README.

DATA_DIR=/generated/tfrecords/dir
HPARAMS_SET=score2perf_transformer_base
MODEL=transformer
PROBLEM=score2perf_maestro_language_uncropped_aug
TRAIN_DIR=/training/dir

HPARAMS=\
"label_smoothing=0.0,"\
"max_length=0,"\
"max_target_seq_length=2048"

t2t_trainer \
  --data_dir="${DATA_DIR}" \
  --hparams=${HPARAMS} \
  --hparams_set=${HPARAMS_SET} \
  --model=${MODEL} \
  --output_dir=${TRAIN_DIR} \
  --problem=${PROBLEM} \
  --train_steps=1000000

when I do as what Training at README says, I got this error, after training 1000 epoches, and the python file tries to load from 1000 epoch checkpoint and to evaluaiton.

Not found: Key transformer/parallel_0_3/transformer/transformer/body/decoder/layer_0/self_attention/multihead_attention/k/kernel not found in checkpoint
[[node save/RestoreV2_1 (defined at /.pyenv/versions/3.7.9/envs/tensor2tensor/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]]
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

However, When I run the command at Training again, surprisingly, it succeeds to load from checkpoint and train from 1000 epoch, and save 2000 epoch weights. But then again, when it loads from 2000 epoch checkpoint and try to do evaluaiton, it fails.

For Inference (Sampling from the model), it just fails.

Anyone could help? Thanks in advance.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:5

github_iconTop GitHub Comments

3reactions
Chinatown1444commented, Mar 2, 2022

Hi. I managed to fix the problem for myself by cloning the latest tensor2tensor master branch and installing it by using pip install -e . in the local tensor2tensor repository.

1reaction
JohannesKluehcommented, Feb 13, 2022

Hello. I also have the same problem. Are there any solutions yet?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Whenever i am trying to restore weights from checkpoint, it ...
Whenever i am trying to restore weights from checkpoint, it gives this error "NotFoundError: Restoring from checkpoint failed. This is most likely due...
Read more >
Load weights from checkpoint not working in keras model
C program saves weights in a checkpoint file. When trying to restore weights in python from checkpoint file with: keras.experimental.
Read more >
Checkpointing — PyTorch Lightning 1.6.1 documentation
A Lightning checkpoint has everything needed to restore a training ... To load a model along with its weights and hyperparameters use the...
Read more >
Save and load - TensorFlow for R - RStudio
For other approaches see the TensorFlow Save and Restore guide or Saving in eager. ... Then load the weights from the checkpoint and...
Read more >
checkpoint-store - npm
In-memory key-value store with history! Keys are strings, values are any type.. Latest version: 1.1.0, last published: 7 years ago.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found