Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

encode_plus not returning attention_mask and not padding

See original GitHub issue

🐛 Bug

Tested on RoBERTa and BERT of the master branch, the encode_plus method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True, I don’t get that back.

If these specific tokenizers (RoBERTa/BERT) don’t support this functionality (which would seem odd), it might be useful to also put that in the documentation.

As a small note, there’s also a typo in the documentation:

return_attention_mask – (optional) Set to False to avoir returning attention mask (default True)

Finally, it seems that pad_to_max_length isn’t padding my input (see the example below). I also tried True instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.


from transformers import BertTokenizer

if __name__ == '__main__':
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
    edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']

    # orig_sents and edit_text are lists of sentences
    for orig_sents, edit_sents in zip(orig_text, edit_text):
        orig_tokens = tokenizer.tokenize(orig_sents)
        edit_tokens = tokenizer.tokenize(edit_sents)

        seqs = tokenizer.encode_plus(orig_tokens,
                                     edit_tokens,
                                     return_attention_mask=True,
                                     return_tensors='pt',
                                     pad_to_max_length=120)
        print(seqs)

Output:

{'input_ids': tensor([[  101,  1045,  2066, 26191,  1012,   102,  2079,  2017,  1029,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011,  999,  102, 2002, 5359, 1037, 6547, 7427, 1012,  102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[  101,  2079,  2017,  5959, 16324,  1029,   102,  2026, 13055,  2074, 17776,  2070,   999,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:16 (7 by maintainers)

Top GitHub Comments

5reactions

Jarvanerpcommented, Dec 12, 2019

Hey! For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error The code I’m executing:

titles = [['allround developer', 'Visual Studio Code'],
 ['allround developer', 'IntelliJ IDEA / PyCharm'],
 ['allround developer', 'Version Control']]
enc_titles = [[tokenizer.encode_plus(title[0], max_length=13, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=13, pad_to_max_length=True)] for title in titles]

The error that I am getting:

<ipython-input-213-349f66a39abe> in <module>
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

<ipython-input-213-349f66a39abe> in <listcomp>(.0)
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, return_tensors, return_token_type_ids, return_overflowing_tokens, return_special_tokens_mask, **kwargs)
    816                 If there are overflowing tokens, those will be added to the returned dictionary
    817             stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
--> 818                 from the main sequence returned. The value of this argument defines the number of additional tokens.
    819             truncation_strategy: string selected in the following options:
    820                 - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
    808                 the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
    809                 method)
--> 810             text_pair: Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
    811                 string using the `tokenize` method) or a list of integers (tokenized string ids using the
    812                 `convert_tokens_to_ids` method)

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:
--> 659                     result += [tok]
    660                 elif i == len(split_text) - 1:
    661                     if sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in <genexpr>(.0)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

TypeError: _tokenize() got an unexpected keyword argument 'pad_to_max_length'```

3reactions

BramVanroycommented, Dec 18, 2019

Aha, great. I couldn’t wait because I needed it for a shared task, but nice to see it’s taking form. Almost there!