question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Initializing attention weights in T5

See original GitHub issue

@patrickvonplaten @patil-suraj @craffel Excuse me if this question is repeated but I did not find an answer for it

In these lines

        elif isinstance(module, (LongT5Attention, LongT5LocalAttention, LongT5TransientGlobalAttention)):
            # Mesh TensorFlow attention initialization to avoid scaling before softmax
            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/attention.py#L136
            d_model = self.config.d_model
            key_value_proj_dim = self.config.d_kv
            n_heads = self.config.num_heads
            module.q.weight.data.normal_(mean=0.0, std=factor * ((d_model * key_value_proj_dim) ** -0.5))
            module.k.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
            module.v.weight.data.normal_(mean=0.0, std=factor * (d_model**-0.5))
            module.o.weight.data.normal_(mean=0.0, std=factor * ((n_heads * key_value_proj_dim) ** -0.5))
            if module.has_relative_attention_bias:
                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
                if isinstance(module, LongT5TransientGlobalAttention):
                    module.global_relative_attention_bias.weight.data.normal_(
                        mean=0.0, std=factor * ((d_model) ** -0.5)
                    )

from t5 implementation https://github.com/huggingface/transformers/blob/d0acc9537829e7d067edbb791473bbceb2ecf056/src/transformers/models/longt5/modeling_longt5.py#L1291

  1. we notice that the factor is multiplied by ((d_model * key_value_proj_dim) ** -0.5) for just the query and the output , and with * (d_model**-0.5) for key and value, why? Is there a detailed explanation of that? and still the initial value of the factor is 1.0?

  2. Also today I found this issue https://github.com/huggingface/transformers/issues/16749

According to my understanding to this issue and correct me if I am wrong :

@patrickvonplaten corrects the initialization but still vague for me is the relation between tying word embedding initialization and language model head initialization in this line https://github.com/huggingface/transformers/blob/d0acc9537829e7d067edbb791473bbceb2ecf056/src/transformers/models/t5/modeling_t5.py#L766 and why this condition in not included in longt5 implementation?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
SinclairCodercommented, Aug 4, 2022
0reactions
github-actions[bot]commented, Oct 24, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 - Hugging Face
Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to...
Read more >
Synthesizer: Rethinking Self-Attention for Transformer Models
To this end, we propose SYNTHESIZER, a model that learns synthetic attention weights without token-token interactions. In our exper- iments, we first show...
Read more >
Synthesizer: Rethinking Self-Attention for Transformer ... - arXiv
Instead, the attention weights are initialized to random val- ... 5Note that this follows the sequence transduction style in T5.
Read more >
Neural machine translation with a Transformer and Keras | Text
Figure 2: Visualized attention weights that you can generate at the end of this tutorial. Why Transformers are significant. Transformers excel at modeling ......
Read more >
Sequence-to-Sequence Translation Using Attention - MATLAB ...
Initialize Decoder Model Parameters ... Initialize the weights of the encoding embedding using the Gaussian using the initializeGaussian function. Specify a mean ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found