Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dimensionality of key and values for Attention

See original GitHub issue

I have two questions about the key and value calculation in Attention (and similarly for KNNAttention).

The relevant line is: https://github.com/lucidrains/memorizing-transformers-pytorch/blob/83fa1479d6f7881dd977fbff55681e709e3b250e/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L135

Why is there only one Linear layer to_kv, instead of 2 linear layers to_k and to_v?
Why is the last dimension dim_head*2? I get that *2 is for both k and v, but what about dim_head? I thought q, k, v should all have the same final dimension (i.e. inner_dim==dim_head*heads). My understanding is that this means that either a) there is only 1 attention head, or for b) all heads, k and v are shared. Is there a reason this is done, or am I misunderstanding?

In your Attention class for Performer, q, k, v all have the same dimensions.

Thanks in advance!

Issue Analytics

State:
Created a year ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

manestaycommented, Jun 23, 2022

Thanks! What about this question: Why is there only one Linear layer to_kv, instead of 2 linear layers to_k and to_v?

1reaction

manestaycommented, Jun 22, 2022

I guess this commit cites the paper that does 1 headed attention: https://github.com/lucidrains/memorizing-transformers-pytorch/commit/9f77fd5e4e449d70c02b9cd25a98e1d5ef5f0a72

Read more comments on GitHub >

Top Results From Across the Web

What exactly are keys, queries, and values in attention ...

I hope this help you understand the queries, keys, and values in the (self-)attention mechanism of deep neural networks.

Multi-head attention mechanism: "queries", "keys", and "values ...

Please remember this mantra of attention mechanism: “you compare the 'query' with the 'keys' and get scores/weights for the 'values.' Each score ...

Intuition for concepts in Transformers — Attention Explained

Self Attention. It is the special case when the key and value are the same. We compute a similarity score for a each...

Attention and its Different Forms | by Anusha Lihala

Given a query q and a set of key-value pairs (K, V), attention can be generalised to compute a weighted sum of the...

Queries, Keys, Values, and Attention

Queries, Keys, Values, and Attention ... These are then scaled by the square root of the key vector dimension, dk the scaling improves...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

What is sle_spatial?

FastAttention doesn't give results in agreement with standard attention?