Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when I encode [unused1], return not one token

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): bert

Language I am using the model on (English, Chinese …): English

The problem arises when using: tokenizer.encode(‘[unused1]’)

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:relation extraction

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

tokenizer.encode(“[unused1]”)
but return not one token, if using keras-bert, it will return me only one token

Expected behavior

Environment info

transformers version: least version
Platform:
Python version: 3.7
PyTorch version (GPU?): 1.1.0
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

18reactions

n1t0commented, Jun 3, 2020

Hi @jxyxiangyu! Thank you @BramVanroy & @mfuntowicz for the help on this!

I think in this case the easiest way to handle this, is by adding the tokens you plan to use as special tokens. After all, that’s what they are. They are not added by default since only a handful of them are actually used so you need to do it manually with

tokenizer.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })

Then, it should work for both fast and slow tokenizers:

>>> from transformers import AutoTokenizer

>>> slow = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
>>> fast = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

>>> slow.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })
>>> fast.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })

>>> slow.encode("[unused1]", add_special_tokens=False)
[1]
>>> fast.encode("[unused1]", add_special_tokens=False)
[1]

3reactions

mfuntowiczcommented, Jun 2, 2020

Hi @jxyxiangyu, thanks for reporting this, thanks @BramVanroy to making a code to reproduce.

So far, the behavior you want to achieve needs to be done by deactivating the do_basic_tokenize feature on BertTokenizer, otherwise the input will be splitted on ponctuation chars before actually going through the wordpiece tokenizer.

I don’t think we have an equivalent on the Rust implementation of Bert, let me check internally and get back to you on this point.

Here a snippet of code which should achieve the desired behavior:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased", do_basic_tokenize=False)
tokenizer.tokenize("[unused1]")

>>> ['[unused1]']

tokenizer.encode("[unused1]", add_special_tokens=False)
>>> [1]

tokenizer.decode([1])
>>> '[unused1]'

Top Results From Across the Web

Tokenizer — transformers 3.5.0 documentation - Hugging Face

Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes a dummy input and checks the number...

RFC 4120: The Kerberos Network Authentication Service (V5)

Since no one except the requesting principal and the server know the session key ... context token, the MIT implementation of Kerberos will...

RFC 4120 - The Kerberos Network Authentication Service (V5)

Capability A token that grants the bearer permission to access an object or service. In Kerberos, this might be a ticket whose use...

HTML Standard

One -Page Version html.spec.whatwg.org Multipage Version /multipage Version ... The UTF-8 encode algorithm which takes a character stream and returns a byte ...

arXiv:2110.07415v2 [cs.CL] 6 May 2022

are concatenated into a passage of sentences, and encoded jointly using BERT. The contex- tual embeddings of tokens are aggregated us-.