Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why 8 attention heads rather than 4 for BaseUnet64?

See original GitHub issue

Hi! The text => image UNet in the Imagen paper follows the UNet architecture defined in Improved Denoising Diffusion Probabilistic Models. In that paper, they use 4 attention heads:

In the BaseUnet64, the # of attention heads is set to 8: https://github.com/lucidrains/imagen-pytorch/blob/2535012168d8839130af9c2b61ae17d6df3a7064/imagen_pytorch/imagen_pytorch.py#L1712

Is this b/c more attention heads = generally better?

The entire thing is a bit confusing, b/c the Imagen paper doesn’t specify the number of heads. Rather, it specifies the number of channels per head. This indicates to me that layers near the bottom of the UNet would have more attention heads since they would have a greater amount of channels, which is a deviation from OpenAI’s UNet architecture that the Imagen paper claims to follow. Not sure what’s going on there.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

vedantroycommented, Jul 24, 2022

no it is still pretty similar

they hold the number of attention heads constant at 4, with dimension of 64, so you would project from model dimensions to 256

Ahhh, that’s the thing I was missing. They are projecting from the model dimensions to 256. Got it got it! Thanks for clarifying.

1reaction

lucidrainscommented, Jul 24, 2022

no it is still pretty similar

they hold the number of attention heads constant at 4, with dimension of 64, so you would project from model dimensions to 256

Top Results From Across the Web

Are Sixteen Heads Really Better than One? - ML@CMU Blog

There are a variety of advantages tied to using attention instead of other sentence pooling operators such as recurrent neural networks, ...

Transformers Explained Visually (Part 3): Multi-head Attention ...

In this article, we will go a step further and dive deeper into Multi-head Attention, which is the brains of the Transformer.

Are Sixteen Heads Really Better than One? - NIPS papers

Figure 4 shows that performance drops much more rapidly when heads are pruned from the Enc-Dec attention layers. In particular, pruning more than...

Why multi-head self attention works: math, intuitions and 10+1 ...

Insight 4: The encoder-decoder (cross) attention is significantly more dependent on the multi-headed decomposed representation.

Chapter 8 Attention and Self-Attention for NLP

The fast modelling of long-range dependencies and the multiple attention heads which learn different dependencies makes Transformers a favourable choice for ...