question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large differences between T5 weight initialization in TF and torch

See original GitHub issue
  • transformers version: 4.18.0, master branch

Who can help

@patrickvonplaten

I found some significant differences in weight init between the PT and TF implementations of T5.

The embeddings (model.shared):

  • In PT, according to T5PreTrainedModel._init_weights, they are initialized with random normal with std=1.0: module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)

  • In TF (TFT5Model), the embeddings are initialized as such: self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name="shared") Since initializer_range is not being provided, it is using the default, which is hidden_size**-0.5 (see TFSharedEmbeddings).

This means that in the base model (d=768), the weights in PT are being initialized with stdev=1.0, and in TF they are being initialized with stdev=0.036.

The LM head (model.lm_head):

  • In PT, the initializer is not specified, meaning it is being initialized with a uniform distribution in [-sqrt(1/d_model), sqrt(1/d_model)] (https://pytorch.org/docs/stable/generated/torch.nn.Linear.html). The weights don’t seem to be initialized in _init_weights either. lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)

  • In TF, the initializer is explicitly provided (TFT5ForConditionalGeneration): lm_head_initializer = tf.keras.initializers.RandomNormal(mean=0, stddev=config.initializer_factor)

So, in the base model, the weights in PT are initialized with a uniform distribution of [-0.036, 0.036], and in TF they are initialized with a random normal with stdev=1.0.

I’m not entirely sure about the actual implications of this in model training. But at least the lm_head weights will have a huge impact in loss values initially.

Based on other transformer models I’ve seen, the “correct” answer seems to be that both weights should be initialised with stdev=1.0. But none of the implementations actually does this.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
craffelcommented, Apr 14, 2022
1reaction
jorgemcgomescommented, May 16, 2022

Please take over the issue @patrickvonplaten . This got pretty muddy and I’m not sure what is the right approach here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for transformers.models.t5.modeling_t5
shape) raise logger.info("Initialize PyTorch weight {}".format(name)) pointer.data = torch.
Read more >
Weight Initialization in Neural Net | by Samarth Gupta
This is known as the problem of Symmetry in which we initialize all the weights to the same number. So we want our...
Read more >
TRAINING NEURAL NETWORKS WITH TENSOR CORES
T5. RoBERTa ... o Activations can have orders of magnitude larger values ... Different seeds affect weight initialization, dropout, etc.
Read more >
NVIDIA Deep Learning TensorRT Documentation
Among other things, with Polygraphy you can: Run inference among multiple backends, like TensorRT and ONNX-Runtime, and compare results (for example API,CLI).
Read more >
PyTorch vs. TensorFlow for Transformer-Based NLP ...
Both major neural network frameworks have successfully and fully implemented BERT, especially with the support of HuggingFace. However, although at first ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found