Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

See original GitHub issue

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

codertimocommented, Oct 23, 2018

@coddinglxf that’s what I thought at first, but can’t implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez 😃 It would be really cool to do it 👍

0reactions

codertimocommented, Oct 30, 2018

@leon-cas yes #36 it’s solved with your question

Top Results From Across the Web

bert-large-uncased-whole-word-masking - Hugging Face

The training is identical -- each masked WordPiece token is ... This means it was pretrained on the raw texts only, with no...

Masked-Language Modeling With BERT | by James Briggs

The BERT paper uses a 15% probability of masking each token during model pre-training, with a few additional rules — we'll use a...

STRUCTBERT - OpenReview

resentations, neural language models are designed to define the joint ... new word objective is jointly trained together with the original masked LM ......

Unmasking BERT: The Key to Transformer Model Performance

This is why we say that MLMs are “bidirectional” since they have access to words that are before and after the current word...

Time Masking for Temporal Language Models - arXiv

At the heart of the masked language modeling (MLM) approach is the task of predicting ... enables the learning of word representations that...