question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to ignore PAD tokens for NER

See original GitHub issue

Hi,

Thank you for such a great repo. I am trying to use the word/token embeddings from the pretrained transformers for NER. The following code is a snippet of my model. For simplicity I am using a Linear decoder as opposed to a CRF decoder.

model_bert = BertModel.from_pretrained(model_dir, config=config)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

class BERTNER(nn.Module):
    def __init__(self, model, hidden_dim,num_labels):
        """
        Torch model that uses the BERT and adds in a classifiers at the end. Num labels is a list of labels
        """
        super(BERTNER self).__init__()
        self.model = model
        self.hidden_dim = hidden_dim
        self.num_labels = num_labels
        self.rnn = nn.LSTM(self.model.config.hidden_size, hidden_dim, batch_first=True, bidirectional=True)
        self.classifier = nn.Linear(2*hidden_dim, num_labels)

    def forward(self,input_ids,attention_mask):

        outputs = self.model(input_ids=input_ids,attention_mask=attention_mask)
        sequence_output = outputs[0]

        out,_ = self.rnn(sequence_output)
        return self.classifier(out)

model = BERTNER(model_bert,128,len(tag2idx))
       

And this is the part I am confused. My input to the model are all padded to be fixed length. And generally, when the sentences are padded, if one uses nn.Embedding and then the padding can be ignored. https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html. But here it is not clear to me how to ignore the padded tokens. Any help will be greatly appreciated. Thanks in advance.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
NielsRoggecommented, Jul 23, 2021

First, placing an LSTM on top of the final hidden states of a model like BERT is not needed. You can just place a linear layer on top. Any xxxForTokenClassification model in the library is implemented that way, and it works really well.

Second, to ignore padding tokens, you should make predictions for all tokens, but simply label pad tokens with -100, as this is the default ignore_index of the CrossEntropyLoss in PyTorch. This means that they will not be taken into account by the loss function.

Btw, I do have an example notebook for NER which you find here. There’s also the official one which you can find here.

2reactions
david-waterworthcommented, Jul 23, 2021

The attention_mask indicates if a token is padding or an actual token. The usual way to deal with padding in the LSTM is to pass lengths for each sequence, you can work this out by summing the attention_mask along the “time” access, ie something like

sequence_lengths = torch.sum(attention_mask, dim=1)

packed_sequence = nn.utils.rnn.pack_padded_sequence(sequence_output, sequence_lengths)
outputs, hidden = self.rnn(packed_sequence)
outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs) 

You’ll have to double check the axis you want to sum over, and that attention_mask=1 for non-padded tokens (otherwise you’ll have to negate it) but hopefully this will help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Token classification - Hugging Face
Mapping all tokens to their corresponding word with the word_ids method. · Assigning the label -100 to the special tokens [CLS] and [SEP]...
Read more >
Tensorflow BERT for token-classification - exclude pad-tokens ...
Yes, this is normal. The output of BERT [batch_size, max_seq_len = 100, hidden_size] will include values or embeddings for [PAD] tokens as ...
Read more >
How to Fine-Tune BERT for NER Using HuggingFace
How to Pad the Samples. Another issue is different samples can get tokenized into different lengths, so we need to add pad tokens...
Read more >
Lessons Learned from Fine-Tuning BERT for Named Entity ...
First, NER is token-level classification, meaning that the model makes ... its predictions for [PAD] tokens' labels were essentially random!
Read more >
Named Entity Recognition with BERT in PyTorch
NER is a task in NLP to identify and extract meaningful ... padding : to pad the sequence with a special [PAD] token...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found