Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`bert-base-uncased` tokenizer broke around special tokens in v2.2.1

See original GitHub issue

In v2.2.1, the bert-base-uncased tokenizer changed in a way that’s probably not intentional:

Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
    'input_ids': [101, 1037, 1010, 1031, 7308, 1033, 5297, 20554, 2361, 6251, 1012, 102],
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

In v2.2.0:

Python 3.7.5 (default, Oct 25 2019, 10:52:18)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers.tokenization_auto import AutoTokenizer
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
>>> t = AutoTokenizer.from_pretrained("bert-base-uncased"); t.encode_plus(text='A, [MASK] AllenNLP sentence.')
{
    'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
    'input_ids': [101, 1037, 1010, 103, 5297, 20554, 2361, 6251, 1012, 102],
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

(indented the results for clarity)

The key difference is that in v2.2.0, it recognizes the [MASK] token as a special token and gives it token id 103. In v2.2.1, this no longer happens. The behavior of bert-base-cased has not changed, so I don’t think this is an intentional change.

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

dirkgrcommented, Dec 11, 2019

I screwed up. It is fixed in master after all.

0reactions

julien-ccommented, Dec 11, 2019

Good to hear! We’ll push a new release soon, cc @LysandreJik

Top Results From Across the Web

Source code for transformers.tokenization_bert - Hugging Face

[docs]class BertTokenizer(PreTrainedTokenizer): r""" Constructs a BERT tokenizer. Based on WordPiece. This tokenizer inherits from :class:`~transformers.

BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick

The tokenizer.encode_plus function combines multiple steps for us: Split the sentence into tokens. Add the special [CLS] and [SEP] ...

Information Supplement • PCI DSS Tokenization Guidelines

Standard: PCI Data Security Standard (PCI DSS). Version: 2.0. Date: August 2011. Author: Scoping SIG, Tokenization Taskforce. PCI Security Standards Council ...

ngadminq/transformers - Gitee

Sylvain Gugger Add tokenizer to Trainer (#6689) 124c3d6 2年前 ... Quick tour: pipelines, Using Pipelines: Wrapper around tokenizer and models to use ...

Tutorial: Fine-tuning BERT for Sentiment Analysis - by Skim AI

Remove other special characters - Remove stop words except "not" and "can" - Remove ... fixed vocabulary and (2) the BERT tokenizer has...