Normalized Crossed Entropy and Label Smoothing
See original GitHub issueHi, thank you for sharing the implementation!
I was just wondering if you could explain the loss computation where you use confidence and label smoothing. I know that this is also done in the tensor2tensor repo, but I have a hard time reading and understanding this concept from that repo as well. I was reading up on Normalized Cross Entropy here, but it seems like both the formula you used here and in tensor2tensor doesn’t really fit with the formula that they were talking about in that article. Could you elaborate on the implementation of that formula.
Also, since we’re taking into consideration both the correct values and incorrect values, this is different than normal cross entropy? Since with cross entropy where the true values have true probability of 1 and incorrect values have probability of 0, cross entropy is only affected by the predicted probability of the true value, and it doesn’t matter how the remaining probability fraction is distributed over the incorrect values. However, with the NCE formula they provided, we have to take into consideration the incorrect values as well, right?
def get_loss(pred, ans, vocab_size, label_smoothing, pad):
# took this "normalizing" from tensor2tensor. We subtract it for
# readability. This makes no difference on learning.
confidence = 1.0 - label_smoothing
low_confidence = (1.0 - confidence) / float(vocab_size - 1)
normalizing = -(
confidence * math.log(confidence) + float(vocab_size - 1) *
low_confidence * math.log(low_confidence + 1e-20))
one_hot = torch.zeros_like(pred).scatter_(1, ans.unsqueeze(1), 1)
one_hot = one_hot * confidence + (1 - one_hot) * low_confidence
log_prob = F.log_softmax(pred, dim=1)
xent = -(one_hot * log_prob).sum(dim=1)
xent = xent.masked_select(ans != pad)
loss = (xent - normalizing).mean()
return loss
I’m training a chatbot with around 150k words in my vocabulary, and so the starting iterations, each log_softmax entry is about ~-11 and so the sum over each sentence position is around 3000 when I do xent = -(one_hot * log_prob).sum(dim=1)
, and so the average loss is around 3000 when I take the mean over all predictions. Does this sound reasonable … it seems like 3000 for loss is kind of off the roof?
Thanks in advance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (4 by maintainers)
Hmm… I’m not sure what I just changed but the loss seems pretty reasonable now, it’s improving to about ~4.6 so hopefully there’s an improvement. Thank you so much!
What is the range(min/max) of your logit/softmax values for each vocab?
you said it’s around 5 when
label_smoothing
is zero. it meansand, if not, it’s around 3000.
then,
If I assume that all logit values have the same value,
But, if their logit values have the same value, its softmax value should be around
1/150000
, andlog_softmax(logit)
has to be around-12
. It does not make sense in this case.So, I guess the reason why your value is close to 3000 is some of your logit value is relatively smaller than others. Try to change initialization values of the embedding layer, and see how it’s going. and, I still think this high loss may not be a big problem if the loss converges, or you can also reduce the value of label smoothing constant.