question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

encode_plus not returning attention_mask and not padding

See original GitHub issue

šŸ› Bug

Tested on RoBERTa and BERT of the master branch, the encode_plus method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True, I don’t get that back.

If these specific tokenizers (RoBERTa/BERT) don’t support this functionality (which would seem odd), it might be useful to also put that in the documentation.

As a small note, there’s also a typo in the documentation:

return_attention_mask – (optional) Set to False to avoir returning attention mask (default True)

Finally, it seems that pad_to_max_length isn’t padding my input (see the example below). I also tried True instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.


from transformers import BertTokenizer

if __name__ == '__main__':
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
    edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']

    # orig_sents and edit_text are lists of sentences
    for orig_sents, edit_sents in zip(orig_text, edit_text):
        orig_tokens = tokenizer.tokenize(orig_sents)
        edit_tokens = tokenizer.tokenize(edit_sents)

        seqs = tokenizer.encode_plus(orig_tokens,
                                     edit_tokens,
                                     return_attention_mask=True,
                                     return_tensors='pt',
                                     pad_to_max_length=120)
        print(seqs)

Output:

{'input_ids': tensor([[  101,  1045,  2066, 26191,  1012,   102,  2079,  2017,  1029,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011,  999,  102, 2002, 5359, 1037, 6547, 7427, 1012,  102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[  101,  2079,  2017,  5959, 16324,  1029,   102,  2026, 13055,  2074, 17776,  2070,   999,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

5reactions
Jarvanerpcommented, Dec 12, 2019

Hey! For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error The code I’m executing:

titles = [['allround developer', 'Visual Studio Code'],
 ['allround developer', 'IntelliJ IDEA / PyCharm'],
 ['allround developer', 'Version Control']]
enc_titles = [[tokenizer.encode_plus(title[0], max_length=13, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=13, pad_to_max_length=True)] for title in titles]

The error that I am getting:

<ipython-input-213-349f66a39abe> in <module>
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

<ipython-input-213-349f66a39abe> in <listcomp>(.0)
      4 # titles = [' '.join(title) for title in titles]
      5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, return_tensors, return_token_type_ids, return_overflowing_tokens, return_special_tokens_mask, **kwargs)
    816                 If there are overflowing tokens, those will be added to the returned dictionary
    817             stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
--> 818                 from the main sequence returned. The value of this argument defines the number of additional tokens.
    819             truncation_strategy: string selected in the following options:
    820                 - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
    808                 the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
    809                 method)
--> 810             text_pair: Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
    811                 string using the `tokenize` method) or a list of integers (tokenized string ids using the
    812                 `convert_tokens_to_ids` method)

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:
--> 659                     result += [tok]
    660                 elif i == len(split_text) - 1:
    661                     if sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in <genexpr>(.0)
    654             result = []
    655             split_text = text.split(tok)
--> 656             for i, sub_text in enumerate(split_text):
    657                 sub_text = sub_text.strip()
    658                 if i == 0 and not sub_text:

TypeError: _tokenize() got an unexpected keyword argument 'pad_to_max_length'```
3reactions
BramVanroycommented, Dec 18, 2019

Aha, great. I couldn’t wait because I needed it for a shared task, but nice to see it’s taking form. Almost there!

Read more comments on GitHub >

github_iconTop Results From Across the Web

attention_mask is missing in the returned dict from tokenizer ...
But now, I get only dict_keys(['input_ids', 'token_type_ids']) from encode_plus. Also, I realized that the returned input_ids are not padded ...
Read more >
Tokenizer — transformers 2.11.0 documentation - Hugging Face
If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max...
Read more >
BERT - Tokenization and Encoding - Albert Au Yeung
The "attention mask" tells the model which tokens should be attended to and which (the [PAD] tokens) should not (see the documentation for...
Read more >
How to use BERT from the Hugging Face transformer library
In the code below, you will see me not adding all the parameters I listed ... tokenizer.encode_plus() specifically returns a dictionary ofĀ ...
Read more >
Smart Batching Tutorial - Speed Up BERT Training
While the attention mask ensures that the [PAD] tokens don't ... This version does not compare to the 1995 version with Amanda Root...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found