encode_plus not returning attention_mask and not padding
See original GitHub issueš Bug
Tested on RoBERTa and BERT of the master branch, the encode_plus
method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True
, I donāt get that back.
If these specific tokenizers (RoBERTa/BERT) donāt support this functionality (which would seem odd), it might be useful to also put that in the documentation.
As a small note, thereās also a typo in the documentation:
return_attention_mask ā (optional) Set to False to avoir returning attention mask (default True)
Finally, it seems that pad_to_max_length
isnāt padding my input (see the example below). I also tried True
instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.
from transformers import BertTokenizer
if __name__ == '__main__':
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']
# orig_sents and edit_text are lists of sentences
for orig_sents, edit_sents in zip(orig_text, edit_text):
orig_tokens = tokenizer.tokenize(orig_sents)
edit_tokens = tokenizer.tokenize(edit_sents)
seqs = tokenizer.encode_plus(orig_tokens,
edit_tokens,
return_attention_mask=True,
return_tensors='pt',
pad_to_max_length=120)
print(seqs)
Output:
{'input_ids': tensor([[ 101, 1045, 2066, 26191, 1012, 102, 2079, 2017, 1029, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011, 999, 102, 2002, 5359, 1037, 6547, 7427, 1012, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 2079, 2017, 5959, 16324, 1029, 102, 2026, 13055, 2074, 17776, 2070, 999, 102]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:16 (7 by maintainers)
Hey! For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error The code Iām executing:
The error that I am getting:
Aha, great. I couldnāt wait because I needed it for a shared task, but nice to see itās taking form. Almost there!