question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`bert-base-uncased` tokenizer broke around special tokens in v2.2.1

See original GitHub issue

In v2.2.1, the bert-base-uncased tokenizer changed in a way that’s probably not intentional:

Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
    'input_ids': [101, 1037, 1010, 1031, 7308, 1033, 5297, 20554, 2361, 6251, 1012, 102],
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

In v2.2.0:

Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
    'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
    'input_ids': [101, 1037, 1010, 103, 5297, 20554, 2361, 6251, 1012, 102],
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

(indented the results for clarity)

The key difference is that in v2.2.0, it recognizes the [MASK] token as a special token and gives it token id 103. In v2.2.1, this no longer happens. The behavior of bert-base-cased has not changed, so I don’t think this is an intentional change.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dirkgrcommented, Dec 11, 2019

I screwed up. It is fixed in master after all.

0reactions
julien-ccommented, Dec 11, 2019

Good to hear! We’ll push a new release soon, cc @LysandreJik

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for transformers.tokenization_bert - Hugging Face
[docs]class BertTokenizer(PreTrainedTokenizer): r""" Constructs a BERT tokenizer. Based on WordPiece. This tokenizer inherits from :class:`~transformers.
Read more >
BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
The tokenizer.encode_plus function combines multiple steps for us: Split the sentence into tokens. Add the special [CLS] and [SEP] ...
Read more >
Information Supplement • PCI DSS Tokenization Guidelines
Standard: PCI Data Security Standard (PCI DSS). Version: 2.0. Date: August 2011. Author: Scoping SIG, Tokenization Taskforce. PCI Security Standards Council ...
Read more >
ngadminq/transformers - Gitee
Sylvain Gugger Add tokenizer to Trainer (#6689) 124c3d6 2年前 ... Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models to use ...
Read more >
Tutorial: Fine-tuning BERT for Sentiment Analysis - by Skim AI
Remove other special characters - Remove stop words except "not" and "can" - Remove ... fixed vocabulary and (2) the BERT tokenizer has...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found