Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TFBertForMaskedLM won't reload from saved checkpoint, shape mismatch issue

See original GitHub issue

Environment info

transformers version: 4.5.1-4.7
Platform: Debian GNU/Linux 10 (buster)
Python version: 3.9.2
PyTorch version (GPU?): N/A
Tensorflow version (GPU?): 2.5.0
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@Rocketknight1, @LysandreJik, @sgugger

Information

Model I am using: TFBertForMaskedLM

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
[ X] my own task or dataset: (give details below)

I believe this issue also affects the official TFTrainer implementation as the checkpoint restore snippet was adapted from it.

To reproduce

Steps to reproduce the behavior:

Generate Masked Batch
initialize TF Model and assign CheckpointManager
Save model checkpoint
initialize new TF Model and assign CheckpointManager
restore from checkpoint

import numpy as np
from transformers import AutoTokenizer, TFAutoModelForMaskedLM, AutoConfig, TFAutoModelForCausalLM
import tensorflow as tf

random_sentences = ["You'll see the rainbow bridge after it rains cats and dogs.",
"They looked up at the sky and saw a million stars.",
"The bullet pierced the window shattering it before missing Danny's head by mere millimeters.",
"He was willing to find the depths of the rabbit hole in order to be with her."]

tok = AutoTokenizer.from_pretrained('bert-base-uncased')
input_ids = tok.batch_encode_plus(random_sentences,return_tensors='np',padding=True)['input_ids']

#Create masked tokens as labels
labels = np.ones_like(input_ids)*-100
mask = (np.random.uniform(size=input_ids.shape)<=0.2) & (input_ids != 0)
labels[mask]=tok.mask_token_id

batch= {'input_ids':tf.convert_to_tensor(input_ids),
        'labels':tf.convert_to_tensor(labels)}

"""## Run model and save checkpoint"""

model = TFAutoModelForMaskedLM.from_pretrained('bert-base-uncased')
checkpoint = tf.train.Checkpoint(model=model)
model.ckpt_manager = tf.train.CheckpointManager(checkpoint, './', max_to_keep=1)
out = model(**batch)
print(out.loss.numpy())
model.ckpt_manager.save()

"""## Re-Initialize from config alone an load existing checkpoint"""

cfg = AutoConfig.from_pretrained('bert-base-uncased')
model2 = TFAutoModelForMaskedLM.from_config(cfg)
checkpoint2 = tf.train.Checkpoint(model=model2)
model2.ckpt_manager = tf.train.CheckpointManager(checkpoint2, './', max_to_keep=1)
latest_ckpt = tf.train.latest_checkpoint('./')
status = checkpoint2.restore(latest_ckpt)
status.assert_existing_objects_matched()

out = model2(**batch)
print(out.loss.numpy())

Expected behavior

Expect to fully restore from checkpoint

Current Behavior, error output

ValueError                                Traceback (most recent call last)
<ipython-input-12-5ec2de12ee44> in <module>()
----> 1 out = model2(**batch)
      2 out.loss

19 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in set_shape(self, shape)
   1238       raise ValueError(
   1239           "Tensor's shape %s is not compatible with supplied shape %s" %
-> 1240           (self.shape, shape))
   1241 
   1242   # Methods not supported / implemented for Eager Tensors.

ValueError: Tensor's shape (512, 768) is not compatible with supplied shape [2, 768]

Link to colab

https://colab.research.google.com/drive/12pwo4WSueOT523hh1INw5J_SLpkK0IgB?usp=sharing

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

Rocketknight1commented, Jun 21, 2021

Hey, thank you for that very helpful bit of diagnostic info! That links this with #11202, another issue we have caused by the same underlying problem. This is helpful because I’ll probably need to make some breaking changes to fix that issue, and the fact that it’s causing multiple downstream problems will increase the urgency there.

0reactions

github-actions[bot]commented, Jul 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.