On masked-lm labels and computing the loss
See original GitHub issueRecently I was using bert for my own project, and going through the function mask_tokens I found this line of code
labels[~masked_indices] = -100 # We only compute loss on masked tokens
I wonder why we do this?
like i get the part where we do
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
To mask the input tokens but is it necessary for labels? Like if I had a constant -100 as ground truth and the actual id maybe say 1000 the loss may never converge
And I’ve found two contradictory comments ie
labels[~masked_indices] = -100 # We only compute loss on masked tokens
and
```(run_language_modeling)
masked_lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the masked language modeling loss.
Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring)
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]
```(modeling_bert)
One says loss will be computed on masked and another says will be ignored… Could anyone please let me know about it… Thanks.
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
BertForMaskedLM's loss and scores, how the loss is computed?
I have a simple MaskedLM model with one masked token at position 7. The model returns 20.2516 and 18.0698 as loss and score...
Read more >How does masked_lm_labels argument work in ...
The first argument is the masked input, the masked_lm_labels argument is the desired output. The input_ids should be masked.
Read more >Understanding Masked Language Models (MLM) and Causal ...
And based on the prediction made by the model against the actual label, we calculate cross-entropy loss and backpropagate it to train the ......
Read more >How does masked_lm_labels work - PyTorch Forums
LongTensor of shape (batch_size, sequence_length), optional, defaults to None) – Labels for computing the masked language modeling loss.
Read more >End-to-end Masked Language Modeling with BERT - Keras
Mean(name="loss") class MaskedLanguageModel(tf.keras. ... predictions, sample_weight=sample_weight) # Compute gradients trainable_vars ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, I was wondering why this is missing as well. There doesn’t seem to be any documentation indicating that this is happening automatically before the loss is computed. And, based on some limited testing on my end I get different values for the loss when I do this.
@LysandreJik Isn’t the example mentioned in the official documentation missing the following line of code before feeding labels into model?
labels[inputs.input_ids != tokenizer.mask_token_id] = -100
I believe, with this we calculate the negative log likelihood, just for the masked token which is `Paris’ in the given example.