Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to ignore PAD tokens for NER

See original GitHub issue

Hi,

Thank you for such a great repo. I am trying to use the word/token embeddings from the pretrained transformers for NER. The following code is a snippet of my model. For simplicity I am using a Linear decoder as opposed to a CRF decoder.

model_bert = BertModel.from_pretrained(model_dir, config=config)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

class BERTNER(nn.Module):
    def __init__(self, model, hidden_dim,num_labels):
        """
        Torch model that uses the BERT and adds in a classifiers at the end. Num labels is a list of labels
        """
        super(BERTNER self).__init__()
        self.model = model
        self.hidden_dim = hidden_dim
        self.num_labels = num_labels
        self.rnn = nn.LSTM(self.model.config.hidden_size, hidden_dim, batch_first=True, bidirectional=True)
        self.classifier = nn.Linear(2*hidden_dim, num_labels)

    def forward(self,input_ids,attention_mask):

        outputs = self.model(input_ids=input_ids,attention_mask=attention_mask)
        sequence_output = outputs[0]

        out,_ = self.rnn(sequence_output)
        return self.classifier(out)

model = BERTNER(model_bert,128,len(tag2idx))

And this is the part I am confused. My input to the model are all padded to be fixed length. And generally, when the sentences are padded, if one uses nn.Embedding and then the padding can be ignored. https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html. But here it is not clear to me how to ignore the padded tokens. Any help will be greatly appreciated. Thanks in advance.

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

NielsRoggecommented, Jul 23, 2021

First, placing an LSTM on top of the final hidden states of a model like BERT is not needed. You can just place a linear layer on top. Any xxxForTokenClassification model in the library is implemented that way, and it works really well.

Second, to ignore padding tokens, you should make predictions for all tokens, but simply label pad tokens with -100, as this is the default ignore_index of the CrossEntropyLoss in PyTorch. This means that they will not be taken into account by the loss function.

Btw, I do have an example notebook for NER which you find here. There’s also the official one which you can find here.

2reactions

david-waterworthcommented, Jul 23, 2021

The attention_mask indicates if a token is padding or an actual token. The usual way to deal with padding in the LSTM is to pass lengths for each sequence, you can work this out by summing the attention_mask along the “time” access, ie something like

sequence_lengths = torch.sum(attention_mask, dim=1)

packed_sequence = nn.utils.rnn.pack_padded_sequence(sequence_output, sequence_lengths)
outputs, hidden = self.rnn(packed_sequence)
outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)

You’ll have to double check the axis you want to sum over, and that attention_mask=1 for non-padded tokens (otherwise you’ll have to negate it) but hopefully this will help.