question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some questions in Loss Function for MaskedLM

See original GitHub issue

Use the same sentence in your Usage Section:

# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
tokenized_text[masked_index] = '[MASK]'

Q1. When we use this sentence as training data,according to your code

 if masked_lm_labels is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
            return masked_lm_loss

seem the loss is a sum of all word in this sentence, not the single word “henson”, am I right? But in my opinion, we only need to calculate the masked word’s loss, not the whole sentence?

Q2. It’s also a question about masked, “chooses 15% of tokens at random” in the paper, I don’t know how to understand it… For each word, a probability of 15% to be masked or just 15% of the sentence is masked? Hope you could help me fix them.

By the way, the notes in line 731 in: pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py should be : if masked_lm_labels is not None, missed a word “not”.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
tholorcommented, Dec 28, 2018

Q1:[…] But in my opinion, we only need to calculate the masked word’s loss, not the whole sentence?

It’s exactly what is done in the current implementation. The labels of not masked tokens are set to -1 and the loss function ignores those tokens by setting ignore_index=-1 (see documentation)

Q2. It’s also a question about masked, “chooses 15% of tokens at random” in the paper, I don’t know how to understand it… For each word, a probability of 15% to be masked or just 15% of the sentence is masked?

Each token has a probability of 15% of getting masked. You might wanna checkout this code to get a better understanding

0reactions
thomwolfcommented, Jan 7, 2019

Thanks @tholor!

Read more comments on GitHub >

github_iconTop Results From Across the Web

BertForMaskedLM's loss and scores, how the loss is computed?
I have a simple MaskedLM model with one masked token at position 7. The model returns 20.2516 and 18.0698 as loss and score...
Read more >
Masked-Language Modeling With BERT | by James Briggs
The BERT paper uses a 15% probability of masking each token during model pre-training, with a few additional rules — we'll use a...
Read more >
What is the loss function and training task on which the ...
... "The training loss is the sum of the mean masked LM [language model] likelihood and the mean next sentence prediction likelihood.".
Read more >
End-to-end Masked Language Modeling with BERT - Keras
The encode function encodes raw text into integer token ids. The get_masked_input_and_labels function will mask input token ids. It masks 15% of ...
Read more >
How does BERT loss function works? - Stack Overflow
I'm confused about how cross-entropy works in bert LM. To calculate loss function we need the truth labels of masks.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found