Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some questions in Loss Function for MaskedLM

See original GitHub issue

Use the same sentence in your Usage Section:

# Tokenized input
text = "Who was Jim Henson ? Jim Henson was a puppeteer"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6
tokenized_text[masked_index] = '[MASK]'

Q1. When we use this sentence as training data，according to your code

 if masked_lm_labels is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
            return masked_lm_loss

seem the loss is a sum of all word in this sentence, not the single word “henson”, am I right? But in my opinion, we only need to calculate the masked word’s loss, not the whole sentence?

Q2. It’s also a question about masked, “chooses 15% of tokens at random” in the paper, I don’t know how to understand it… For each word, a probability of 15% to be masked or just 15% of the sentence is masked? Hope you could help me fix them.

By the way, the notes in line 731 in: pytorch-pretrained-BERT/pytorch_pretrained_bert/modeling.py should be : if masked_lm_labels is not None, missed a word “not”.

Issue Analytics

State:
Created 5 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

tholorcommented, Dec 28, 2018

Q1:[…] But in my opinion, we only need to calculate the masked word’s loss, not the whole sentence?

It’s exactly what is done in the current implementation. The labels of not masked tokens are set to -1 and the loss function ignores those tokens by setting ignore_index=-1 (see documentation)

Q2. It’s also a question about masked, “chooses 15% of tokens at random” in the paper, I don’t know how to understand it… For each word, a probability of 15% to be masked or just 15% of the sentence is masked?

Each token has a probability of 15% of getting masked. You might wanna checkout this code to get a better understanding

0reactions

thomwolfcommented, Jan 7, 2019

Thanks @tholor!