Dimensionality of key and values for Attention
See original GitHub issueI have two questions about the key and value calculation in Attention (and similarly for KNNAttention).
The relevant line is: https://github.com/lucidrains/memorizing-transformers-pytorch/blob/83fa1479d6f7881dd977fbff55681e709e3b250e/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L135
- Why is there only one Linear layer
to_kv
, instead of 2 linear layersto_k
andto_v
? - Why is the last dimension
dim_head*2
? I get that *2 is for both k and v, but what about dim_head? I thought q, k, v should all have the same final dimension (i.e.inner_dim==dim_head*heads
). My understanding is that this means that either a) there is only 1 attention head, or for b) all heads, k and v are shared. Is there a reason this is done, or am I misunderstanding?
In your Attention class for Performer, q, k, v all have the same dimensions.
Thanks in advance!
Issue Analytics
- State:
- Created a year ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
What exactly are keys, queries, and values in attention ...
I hope this help you understand the queries, keys, and values in the (self-)attention mechanism of deep neural networks.
Read more >Multi-head attention mechanism: "queries", "keys", and "values ...
Please remember this mantra of attention mechanism: “you compare the 'query' with the 'keys' and get scores/weights for the 'values.' Each score ...
Read more >Intuition for concepts in Transformers — Attention Explained
Self Attention. It is the special case when the key and value are the same. We compute a similarity score for a each...
Read more >Attention and its Different Forms | by Anusha Lihala
Given a query q and a set of key-value pairs (K, V), attention can be generalised to compute a weighted sum of the...
Read more >Queries, Keys, Values, and Attention
Queries, Keys, Values, and Attention ... These are then scaled by the square root of the key vector dimension, dk the scaling improves...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks! What about this question: Why is there only one Linear layer to_kv, instead of 2 linear layers to_k and to_v?
I guess this commit cites the paper that does 1 headed attention: https://github.com/lucidrains/memorizing-transformers-pytorch/commit/9f77fd5e4e449d70c02b9cd25a98e1d5ef5f0a72