question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about the concated tokens (where is the `noised image token`?)

See original GitHub issue

Hi Phil, when reading the DiffusionPriorNetwork forward part, I noticed the concated tokens feed into the CausalTransformer are composed like below: https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L775-L780

But, refer to the original paper in Section2.2, it wrote as ...consisting of encoded text, the CLIP text embedding, an embedding for the diffusion timestep, the noised CLIP image embedding, and a final embedding whose output from the Transformer is used to predict the unnoised CLIP image embedding. , I just wonder which part belongs to the the noised CLIP image embedding (maybe learned_queries ?) It just confuses me.

Enjoy!

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
CiaoHecommented, May 7, 2022

I don’t understand you. The purpose of the Prior is to predict a range of CLIP image embeddings given inputs:

  • CLIP text embeddings, and
  • optionally, text

In that case, then how do we pass CLIP image embeddings to Prior network to noise it? (your point #1).

Also, the purpose of the entire pipeline Prior -> Decoder is so we can input to the Prior:

  • text
  • CLIP text embedding obtained from text. The Prior should then predict the CLIP image embedding, to be decoded by the decoder to generate some image.

We have access to image and hence CLIP image embedding during training, but how do we obtain this during test time when access to image is not available should we use CLIP image embedding (or some noised version) as an input to the Prior?

Let me clarify it step by step. When sampling. You use p_sample_loop() right? p_sample_loop() just call p_sample() to finish backward progress(generate from noise to clear one). So the init img_embed (as wrote in line881) is random initialized. https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L877-L885 So, when do sampling, func p_sample() will call p_sample_variance() to get \mu and \sigma for sampling:https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L870, so the x is just image_emb, text-related information all included in text_cond(dict type). Then, in p_sample_variance(), it will forward the PriorNet https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L849 What is x here? I think it still should be image_emb. Next, jump into the PriorNet’s forward, and we can see it parses image_embed in,https://github.com/lucidrains/DALLE2-pytorch/blob/fd53fa17db37dcec2e89c334da3fffcd89285ff7/dalle2_pytorch/dalle2_pytorch.py#L727-L729


So, my point is: in inference time, the image_emb just be initialized since we don’t have any image. During PriorNet generating process, the image_emb will be refined(or say generated) by using text(or text_emb) information. Once you get generated image_emb, the rest of the work just passes to the Decoder part.

But anyway, in the current version of the PriorNet forward process, I cannot see the image_emb join the combined token which will be fed into the Causal Transformer. This is what I am concerned.

1reaction
CiaoHecommented, May 7, 2022

@lucidrains Haha, thanks for your attention. I learned a lot from your codes and really want to make a little contribution. And, thanks for your invitation, if I have a chance, I will thank you in person

Read more comments on GitHub >

github_iconTop Results From Across the Web

how to concatenate tokens and strings and numbers in java?
I have a separate array of tokens, that I'm trying to add onto. private static ArrayList<Token> infixToPostfix(ArrayList<Token> intokens) ...
Read more >
arXiv:2203.07682v3 [cs.CV] 20 Oct 2022
image patches can cause undesired artifacts at the token boundaries. While tokenization with large overlapping al- leviates such a problem, ...
Read more >
Understanding BERT — Word Embeddings | by Dharti Dhami
The [CLS] token always appears at the start of the text, and is specific to classification tasks. Both tokens are always required, even...
Read more >
Learning Token-based Representation for Image Retrieval
In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few...
Read more >
The Evolution of Tokenization in NLP — Byte Pair Encoding in ...
The need for a tokenizer has protruded from the question “How can we ... and more tokens mean more input computations to process...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found