Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How is the KL loss computed?

See original GitHub issue

Thanks for the great work! There’s one thing that confuses me very much though. In the paper, the KL loss is computed as (Eq.3) $L_{k l}=\log q_{\phi}\left(z \mid x_{l i n}\right)-\log p_{\theta}\left(z \mid c_{t e x t}, A\right)$ . In vanilla VAEs, the KL loss is actually an expectation. As the two distributions involved are both Gaussians, there is a closed-form expression. It is understandable that as the distribution $p_{\theta}\left(z \mid c_{t e x t}, A\right)$ is no more a Gaussian in VITS, we don’t calculate the expectation but instead use the sampled $z$ to evaluate the probability density of $q_\phi,p_\theta$ and calculate Eq.3. Till now, there is no problem for me.

Nevertheless, in the code I notice that the KL loss is calculated in a special way, in losses.py:kl_loss(z_p, logs_q, m_p, logs_p, z_mask). In this function, as far as I know, m_p and logs_p are extracted from text encodings, and logs_q is extracted from spectrogram posterior encoder. z_p is the flow-transformed latent variable from posterior encoder. And this function calculates KL loss as the sum of $\frac{1}{2} \frac{(z_p-\mu_p)^2}{\sigma_p^2}$ and $\log\sigma_p - \log\sigma_q - \frac{1}{2}$ . So how does this come? I guess the first term comes from Eq.4 but why is the log-determinant missing? Also, why is $\mu_q$ not participating in this loss? I really cannot relate this calculation with Eq.3 in the paper.

There is another question by the way. I notice that the mean_only switch is turned off in the ResidualCouplingLayer, which means the log-determinant returned by the flow is always 0. In this case, the transformed distribution is still a Gaussian, right?

Again, thanks for the work and code!

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:5

Top GitHub Comments

4reactions

AndreyBocharnikovcommented, Apr 6, 2022

@cantabile-kwok I think I figured it out. First, about the log-determinant, in the section 2.5.2 of the paper it is said we design the normalizing flow to be a volume-preserving transformation with the Jacobian determinant of one, by the definition, volume-preserving transformation is a transformation with log-determinant equals to zero, and not volume-preserving transformation is a transformation with non-zero log-determinant. In the code it can be seem here, as you rightly said, it’s because of mean_only flag. Second, about the loss, let’s take a look at Eq. 3. Screenshot 2022-04-06 at 19 59 15 The second equation is true since both q and p are normal distributions, next equation is true since z is computed asz = \mu_q + eps * \std_q so the exponent in the numerator become e^(-1/2) and when we get log from the whole expression we get exactly the kl_loss, hope it was useful and clear, feel free to ask any questions.

1reaction

cantabile-kwokcommented, Apr 7, 2022

@AndreyBocharnikov Yes, I agree with u

Top Results From Across the Web

Intuitive Guide to Understanding KL Divergence

Let us now compute the KL divergence for each of the approximate distributions we came up with. First let's take the uniform distribution....

Kullback-Leibler Divergence Explained - Count Bayesie

With KL divergence we can calculate exactly how much information is lost when we approximate one distribution with another.

Deriving the KL divergence loss for VAEs - Cross Validated

The KL divergence loss for a VAE for a single sample is defined as (referenced from this implementation and this explanation):.

2.4.8 Kullback-Leibler Divergence

The KL divergence, which is closely related to relative entropy, informa- ... DKL(p(x),q(x)), is a measure of the information lost when q(x) is...

Kullback–Leibler divergence - Wikipedia

Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions...