Some questions in Loss Function for MaskedLM
See original GitHub issueUse the same sentence in your Usage Section:
# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
tokenized_text[masked_index] = '[MASK]'
Q1. When we use this sentence as training data,according to your code
if masked_lm_labels is not None:
loss_fct = CrossEntropyLoss(ignore_index=-1)
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
return masked_lm_loss
seem the loss is a sum of all word in this sentence, not the single word “henson”, am I right? But in my opinion, we only need to calculate the masked word’s loss, not the whole sentence?
Q2. It’s also a question about masked, “chooses 15% of tokens at random” in the paper, I don’t know how to understand it… For each word, a probability of 15% to be masked or just 15% of the sentence is masked? Hope you could help me fix them.
By the way, the notes in line 731 in: pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py should be : if masked_lm_labels is not None, missed a word “not”.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
BertForMaskedLM's loss and scores, how the loss is computed?
I have a simple MaskedLM model with one masked token at position 7. The model returns 20.2516 and 18.0698 as loss and score...
Read more >Masked-Language Modeling With BERT | by James Briggs
The BERT paper uses a 15% probability of masking each token during model pre-training, with a few additional rules — we'll use a...
Read more >What is the loss function and training task on which the ...
... "The training loss is the sum of the mean masked LM [language model] likelihood and the mean next sentence prediction likelihood.".
Read more >End-to-end Masked Language Modeling with BERT - Keras
The encode function encodes raw text into integer token ids. The get_masked_input_and_labels function will mask input token ids. It masks 15% of ...
Read more >How does BERT loss function works? - Stack Overflow
I'm confused about how cross-entropy works in bert LM. To calculate loss function we need the truth labels of masks.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

It’s exactly what is done in the current implementation. The labels of not masked tokens are set to -1 and the loss function ignores those tokens by setting ignore_index=-1 (see documentation)
Each token has a probability of 15% of getting masked. You might wanna checkout this code to get a better understanding
Thanks @tholor!