KL divergent term in DiscreteVAE
See original GitHub issueHi Phil @lucidrains, I notice a KL divergent term (default set to 0) in the DiscreteVAE
. The paper often quoted (Neural discrete representation learning) have extra two stopgradient terms. Can you point me to the reference to the definition in the code?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top Results From Across the Web
How should I intuitively understand the KL divergence loss in ...
This is where we use KL divergence as a measure of a difference between two probability distributions. The VAE objective function thus ...
Read more >Learning Disentangled Joint Continuous and Discrete ... - arXiv
In most VAE models with a Gumbel-Softmax latent variable, the KL divergence between the. Gumbel-Softmax variables is approximated by the KL ...
Read more >A Quick Primer on KL Divergence - Adam Lineberry
KL divergence is roughly a measure of distance between two probability distributions. There are different forms of the KL divergence equation.
Read more >Disentangling VAEs, KL Divergence, and Mutual Information
I've recently been reading about the JointVAE model proposed by Emilien Dupont in the paper Learning Disentangled Joint Continuous and Discrete ...
Read more >Kullback-Leibler Divergence for Machine Learning - Medium
In layman's terms, the K-L divergence is a measure of how different a specific probability distribution is from a reference distribution.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
according to the authors, they needed something like 5+ epochs on imagenet to get good results, which would translate to >600k iterations with batch size 8. I would highly suggest fine-tuning their pretrained models if possible to save GPU time.
https://github.com/CompVis/taming-transformers/issues/31#issuecomment-809382665
I believe we’ve had some earlier discussion on this topic @richcmwang.
https://github.com/lucidrains/DALLE-pytorch/issues/74#issuecomment-794331986
A podcast called “DeepLearningDeepDive” was able to get one of the main researchers for the DALL-E paper in for an interview - they go over the entire paper.
The researcher is questioned on this very topic in this video
Youtube Video: https://www.youtube.com/watch?v=PtdpWC7Sr98&t=2544s
Podcast: https://podcasts.apple.com/us/podcast/deep-learning-deep-dive/id1555309024 https://open.spotify.com/show/1zqRuymMjxXGKYMmEeTarz
Some have suggested that a researcher was somewhat coy around the equation in “Figure 1” in the paper. Having rewatched the video - I’m fairly certain it’s not terribly important. The researcher claims:
First discussion (heavily paraphrased)
Discussing Differences Between VQGAN paper and DALLE’s VAE
Second
(aside: the whole video is incredibly useful and highly recommend watching it. the OpenAI team made many decisions based on their specific goals which aren’t necessarily hard requirements to implement.)
Other notable quotes (again - paraphrasing)
@janEbert @mehdidc @robvanvolt @rom1504
Discussing the attention architecture: (“row”, “column”, “row”, “row”)
@lucidrains
Sparse Attention for Hardware Constraints? Or for Loss?
Later in the video they recap the subject and intuit that the row and column attention are indeed helping more than just dense attention would because it helps the transformer take advantage of the 2 dimensional nature of images in order to learn a more efficient representation.