question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why 8 attention heads rather than 4 for BaseUnet64?

See original GitHub issue

Hi! The text => image UNet in the Imagen paper follows the UNet architecture defined in Improved Denoising Diffusion Probabilistic Models. In that paper, they use 4 attention heads: image

In the BaseUnet64, the # of attention heads is set to 8: https://github.com/lucidrains/imagen-pytorch/blob/2535012168d8839130af9c2b61ae17d6df3a7064/imagen_pytorch/imagen_pytorch.py#L1712

Is this b/c more attention heads = generally better?

The entire thing is a bit confusing, b/c the Imagen paper doesn’t specify the number of heads. Rather, it specifies the number of channels per head. This indicates to me that layers near the bottom of the UNet would have more attention heads since they would have a greater amount of channels, which is a deviation from OpenAI’s UNet architecture that the Imagen paper claims to follow. Not sure what’s going on there. image

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
vedantroycommented, Jul 24, 2022

no it is still pretty similar

they hold the number of attention heads constant at 4, with dimension of 64, so you would project from model dimensions to 256

Ahhh, that’s the thing I was missing. They are projecting from the model dimensions to 256. Got it got it! Thanks for clarifying.

1reaction
lucidrainscommented, Jul 24, 2022

no it is still pretty similar

they hold the number of attention heads constant at 4, with dimension of 64, so you would project from model dimensions to 256

Read more comments on GitHub >

github_iconTop Results From Across the Web

Are Sixteen Heads Really Better than One? - ML@CMU Blog
There are a variety of advantages tied to using attention instead of other sentence pooling operators such as recurrent neural networks, ...
Read more >
Transformers Explained Visually (Part 3): Multi-head Attention ...
In this article, we will go a step further and dive deeper into Multi-head Attention, which is the brains of the Transformer.
Read more >
Are Sixteen Heads Really Better than One? - NIPS papers
Figure 4 shows that performance drops much more rapidly when heads are pruned from the Enc-Dec attention layers. In particular, pruning more than...
Read more >
Why multi-head self attention works: math, intuitions and 10+1 ...
Insight 4: The encoder-decoder (cross) attention is significantly more dependent on the multi-headed decomposed representation.
Read more >
Chapter 8 Attention and Self-Attention for NLP
The fast modelling of long-range dependencies and the multiple attention heads which learn different dependencies makes Transformers a favourable choice for ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found