Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Normalized Crossed Entropy and Label Smoothing

See original GitHub issue

Hi, thank you for sharing the implementation!

I was just wondering if you could explain the loss computation where you use confidence and label smoothing. I know that this is also done in the tensor2tensor repo, but I have a hard time reading and understanding this concept from that repo as well. I was reading up on Normalized Cross Entropy here, but it seems like both the formula you used here and in tensor2tensor doesn’t really fit with the formula that they were talking about in that article. Could you elaborate on the implementation of that formula.

Also, since we’re taking into consideration both the correct values and incorrect values, this is different than normal cross entropy? Since with cross entropy where the true values have true probability of 1 and incorrect values have probability of 0, cross entropy is only affected by the predicted probability of the true value, and it doesn’t matter how the remaining probability fraction is distributed over the incorrect values. However, with the NCE formula they provided, we have to take into consideration the incorrect values as well, right?

def get_loss(pred, ans, vocab_size, label_smoothing, pad):
    # took this "normalizing" from tensor2tensor. We subtract it for
    # readability. This makes no difference on learning.
    confidence = 1.0 - label_smoothing
    low_confidence = (1.0 - confidence) / float(vocab_size - 1)
    normalizing = -(
        confidence * math.log(confidence) + float(vocab_size - 1) *
        low_confidence * math.log(low_confidence + 1e-20))

    one_hot = torch.zeros_like(pred).scatter_(1, ans.unsqueeze(1), 1)
    one_hot = one_hot * confidence + (1 - one_hot) * low_confidence
    log_prob = F.log_softmax(pred, dim=1)

    xent = -(one_hot * log_prob).sum(dim=1)
    xent = xent.masked_select(ans != pad)
    loss = (xent - normalizing).mean()
    return loss

I’m training a chatbot with around 150k words in my vocabulary, and so the starting iterations, each log_softmax entry is about ~-11 and so the sum over each sentence position is around 3000 when I do xent = -(one_hot * log_prob).sum(dim=1), and so the average loss is around 3000 when I take the mean over all predictions. Does this sound reasonable … it seems like 3000 for loss is kind of off the roof?

Thanks in advance.

Issue Analytics

State:
Created 4 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

tonyhqanguyencommented, Jul 31, 2019

Hmm… I’m not sure what I just changed but the loss seems pretty reasonable now, it’s improving to about ~4.6 so hopefully there’s an improvement. Thank you so much!

0reactions

tunzcommented, Jul 31, 2019

What is the range(min/max) of your logit/softmax values for each vocab?

you said it’s around 5 when label_smoothing is zero. it means

-1*log_softmax(answr_logit) == 5

and, if not, it’s around 3000.

-(0.9*log_softmax(answer_logit) + sum((1/150000.0) * log_softmax(logit) for logit in other logits)) == 3000

then,

sum((1/150000.0) * log_softmax(logit) for logit in other_logits)) == -3000 + 0.9*5

If I assume that all logit values have the same value,

log_softmax(logit) == -3000 + 0.9*5

But, if their logit values have the same value, its softmax value should be around 1/150000, and log_softmax(logit) has to be around -12. It does not make sense in this case.

So, I guess the reason why your value is close to 3000 is some of your logit value is relatively smaller than others. Try to change initialization values of the embedding layer, and see how it’s going. and, I still think this high loss may not be a big problem if the loss converges, or you can also reduce the value of label smoothing constant.

Top Results From Across the Web

Normalized Cross-Entropy - Deylemma

This post explores a normalized version of binary cross-entropy loss in attempt to remove the effect of the prior (class imbalance within ...

Normalized Loss Functions for Deep Learning with Noisy Labels

It has been shown that the commonly used Cross Entropy (CE) loss is not robust to noisy labels. Whilst new loss func- tions...

What is Label Smoothing? - Towards Data Science

Label smoothing is a regularization technique that addresses both ... Label smoothing is used when the loss function is cross entropy, ...

Generalized Entropy Regularization or - ACL Anthology

(ii) Label smoothing assigns infinite cost to sparse ... 2H(p, q) := − ∑z∈Z p(z) log q(z) is cross-entropy and ... Entropy is...

ADAPTIVE LABEL SMOOTHING WITH SELF-KNOWLEDGE

The pre-defined prior label distribution remains unchanged, hence the latter cross-entropy term in. Equation 1 is equivalent to minimizing the KL divergence ...