CogView Think Image and Text Should be weighted the same
See original GitHub issueIn the cogview paper they claim that by giving the text as much importance they achieve a better result. They “hypothesize” that this is because the transformer is learning both how to predict images from text and how logic/knowledge/info works in general. As far as i can tell - it isn’t mentioned again, unfortunately.
At any rate - perhaps we should run some tests with the --img_loss_weight
parameter set to 1?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Top Results From Across the Web
CogView: Mastering Text-to-Image Generation via Transformers
Text -to-Image generation in the general domain has long been an open problem, which requires both generative model and cross-modal understanding. We propose....
Read more >CogView: Image generation and language modelling at scale
First, CogView was trained on Chinese text/image pairs, whereas DALL-E was trained to work in English, and this means that prompts to each...
Read more >CogView: Mastering Text-to-Image Generation via Transformers
If the loss weight of text tokens is set to zero, the model will fail to find the connections between text and image...
Read more >CogView: Mastering Text-to-Image Generation via Transformers
We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to ... super-resolution, text-image ranking and fashion design, ...
Read more >CogView - Text to Image - YouTube
Do you have no artistic skill? Would you like to generate images just by typing in what you want? Welcome to CogView text...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@janEbert
I think they may have been trying to underfit the data for some reason because of issues which are perhaps more apparent at the scale OpenAI was operating at. There’s a good deal of hand-waving/we-did-it-cus-they-did-it in the dalle paper though so I need to revisit their reasoning for weighting the text-predicts-image loss more.
The intuition provided by CogView is equally hand-waving in my opinion and I wouldn’t be surprised if the same weighting would hit the same loss curve over roughly the same amount of time due to scaling laws.
If I may commit a bit of ‘academic fraud’ - my new intuition just based on running a whole bunch of generations with image/text loss the same; is that there is a relationship between weighting and the noise of your dataset. If you have very lengthy captions and they are all very accurate and concise; use a lower text-predicts-image weight and a higher text_seq_len. The text portion of the transformer will indeed benefit from “being able to learn more about the language modality”; although probably mostly due to the increase in text_seq_len.
A good example is Open Images Localized Annotations (~500k image-text pairs). I trained on that dataset with a text seq len in excess of 600. With the weighting at seven, the loss obviously doesn’t go down nearly as quickly; but you can get some pretty great looking outliers a lot earlier on. When I weighted them the same (please remember, all intuition/hypothesis) the loss goes down what it more or less settles on within an order of magnitude much faster. The images look better on average perceptually (although those same outlier examples do tend to look worse - they look worse according to errors you can see in the caption i.e. they look worse perceptively not conceptually). Does that make sense?
I haven’t experimented with a “noisy” dataset - but i bet a good example would be something like wikipedia articles from WIT. Open Images Localized Annotations is literally transcribed spoken word of a human being describing an image as they look at it. Sure - it has errors and stuff - but most of the words in the dataset are actually directly about what is in the image itself. Not a discussion of the history of the subject in the image or what have you. As such - you may want to underfit the transformer’s text-predicts-text loss by increasing the text-predicts-image weight on such a dataset. Perhaps this allows the model to underfit the language modality and sort of be “more open to interpretation” so to speak.
Again - who knows; this stuff would all be fascinating to visualize if we had a heatmap of the attention heads. Unfortunately I have no idea how to implement that.
@janEbert is the relation between the text sequence length at play here? I’ve got some preliminary test runs suggesting this is the case.
Increasing the text_seq_len causes loss to converge much higher.
For instance using a text_seq_len of 384:
If you decrease the image weight to 4, training converges to a a lower value of ~5.5
By decreasing the img loss weight to 1; I finally (very quickly) converge to the loss I’m used to on a “diverse enough” dataset (for lack of more rigorous words).
CogView used 1024 tokens. Re-reading their wording; they don’t necessarily criticize the weighting but moreso just criticize open ai for assuming the impact of text on the loss was merely “auxiliary”. So perhaps the difference in weighting here is due to the training on Chinese text? I’m regrettably uninformed about the differences in English/Chinese and how that relates to training language models.
Is it possible that OpenAI’s dataset simply didn’t benefit much from a lengthy sequence because the average caption of their dataset didn’t tend to be that high in the first place? As such it would be wise to scale other parts of the model perhaps?
I’ll follow up with reconstructions in a bit.