Question about the concated tokens (where is the `noised image token`?)
See original GitHub issueHi Phil,
when reading the DiffusionPriorNetwork
forward part, I noticed the concated tokens feed into the CausalTransformer are composed like below:
https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L775-L780
But, refer to the original paper in Section2.2, it wrote as
...consisting of encoded text, the CLIP text embedding, an embedding for the diffusion timestep, the noised CLIP image embedding, and a final embedding whose output from the Transformer is used to predict the unnoised CLIP image embedding.
,
I just wonder which part belongs to the the noised CLIP image embedding
(maybe learned_queries
?) It just confuses me.
Enjoy!
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:10 (4 by maintainers)
Top Results From Across the Web
how to concatenate tokens and strings and numbers in java?
I have a separate array of tokens, that I'm trying to add onto. private static ArrayList<Token> infixToPostfix(ArrayList<Token> intokens) ...
Read more >arXiv:2203.07682v3 [cs.CV] 20 Oct 2022
image patches can cause undesired artifacts at the token boundaries. While tokenization with large overlapping al- leviates such a problem, ...
Read more >Understanding BERT — Word Embeddings | by Dharti Dhami
The [CLS] token always appears at the start of the text, and is specific to classification tasks. Both tokens are always required, even...
Read more >Learning Token-based Representation for Image Retrieval
In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few...
Read more >The Evolution of Tokenization in NLP — Byte Pair Encoding in ...
The need for a tokenizer has protruded from the question “How can we ... and more tokens mean more input computations to process...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let me clarify it step by step. When sampling. You use
p_sample_loop()
right?p_sample_loop()
just callp_sample()
to finish backward progress(generate from noise to clear one). So the initimg_embed
(as wrote in line881) is random initialized. https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L877-L885 So, when do sampling, funcp_sample()
will callp_sample_variance()
to get \mu and \sigma for sampling:https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L870, so thex
is just image_emb, text-related information all included intext_cond(dict type)
. Then, inp_sample_variance()
, it will forward the PriorNet https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L849 What isx
here? I think it still should beimage_emb
. Next, jump into the PriorNet’s forward, and we can see it parsesimage_embed
in,https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L727-L729So, my point is: in inference time, the
image_emb
just be initialized since we don’t have any image. During PriorNet generating process, theimage_emb
will be refined(or say generated) by using text(or text_emb) information. Once you get generatedimage_emb
, the rest of the work just passes to the Decoder part.But anyway, in the current version of the PriorNet forward process, I cannot see the
image_emb
join the combined token which will be fed into the Causal Transformer. This is what I am concerned.@lucidrains Haha, thanks for your attention. I learned a lot from your codes and really want to make a little contribution. And, thanks for your invitation, if I have a chance, I will thank you in person