`bert-base-uncased` tokenizer broke around special tokens in v2.2.1
See original GitHub issueIn v2.2.1
, the bert-base-uncased
tokenizer changed in a way that’s probably not intentional:
Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
'input_ids': [101, 1037, 1010, 1031, 7308, 1033, 5297, 20554, 2361, 6251, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
In v2.2.0
:
Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'input_ids': [101, 1037, 1010, 103, 5297, 20554, 2361, 6251, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
(indented the results for clarity)
The key difference is that in v2.2.0
, it recognizes the [MASK]
token as a special token and gives it token id 103
. In v2.2.1
, this no longer happens. The behavior of bert-base-cased
has not changed, so I don’t think this is an intentional change.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Source code for transformers.tokenization_bert - Hugging Face
[docs]class BertTokenizer(PreTrainedTokenizer): r""" Constructs a BERT tokenizer. Based on WordPiece. This tokenizer inherits from :class:`~transformers.
Read more >BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
The tokenizer.encode_plus function combines multiple steps for us: Split the sentence into tokens. Add the special [CLS] and [SEP] ...
Read more >Information Supplement • PCI DSS Tokenization Guidelines
Standard: PCI Data Security Standard (PCI DSS). Version: 2.0. Date: August 2011. Author: Scoping SIG, Tokenization Taskforce. PCI Security Standards Council ...
Read more >ngadminq/transformers - Gitee
Sylvain Gugger Add tokenizer to Trainer (#6689) 124c3d6 2年前 ... Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models to use ...
Read more >Tutorial: Fine-tuning BERT for Sentiment Analysis - by Skim AI
Remove other special characters - Remove stop words except "not" and "can" - Remove ... fixed vocabulary and (2) the BERT tokenizer has...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I screwed up. It is fixed in
master
after all.Good to hear! We’ll push a new release soon, cc @LysandreJik