Regarding learned image embedding and text embedding in Unet
See original GitHub issueAccording to the paper Section 2.1 Decoder, it says
We enable classifier-free guidance by randomly setting CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of he time during training.
It seems that we are replacing the embeddings after turning them to condition sequences.
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1216-L1222 https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1229-L1234
And from the following it seems that that null text embeddings can vary according to their sequence position. For image embeddings, I feel it is fine, but what about for text encodings?
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104
Also, it seems perhaps it is needed to have a separate a separate cond_drop_prob one for image embedding and one for text encodings. If we do that, how do we modify forward_with_cond_scale()?
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1166-L1178
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top GitHub Comments
How about the time dependence of null_text_embed?
https://github.com/lucidrains/DALLE2-pytorch/blob/main/dalle2_pytorch/dalle2_pytorch.py#L1104
@xiankgx haha, actually there was another issue with the null padding tokens, only uncovered because of your issues https://github.com/lucidrains/DALLE2-pytorch/commit/1c1e508369da34eb35741558d33203f42fea006e should be ok now
keep it coming! 🙏