question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when I encode [unused1], return not one token

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): bert

Language I am using the model on (English, Chinese …): English

The problem arises when using: tokenizer.encode(‘[unused1]’)

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:relation extraction

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. tokenizer.encode(“[unused1]”)
  2. but return not one token, if using keras-bert, it will return me only one token

Expected behavior

Environment info

  • transformers version: least version
  • Platform:
  • Python version: 3.7
  • PyTorch version (GPU?): 1.1.0
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

18reactions
n1t0commented, Jun 3, 2020

Hi @jxyxiangyu! Thank you @BramVanroy & @mfuntowicz for the help on this!

I think in this case the easiest way to handle this, is by adding the tokens you plan to use as special tokens. After all, that’s what they are. They are not added by default since only a handful of them are actually used so you need to do it manually with

tokenizer.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })

Then, it should work for both fast and slow tokenizers:

>>> from transformers import AutoTokenizer

>>> slow = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
>>> fast = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)

>>> slow.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })
>>> fast.add_special_tokens({ "additional_special_tokens": [ "[unused1]" ] })

>>> slow.encode("[unused1]", add_special_tokens=False)
[1]
>>> fast.encode("[unused1]", add_special_tokens=False)
[1]
3reactions
mfuntowiczcommented, Jun 2, 2020

Hi @jxyxiangyu, thanks for reporting this, thanks @BramVanroy to making a code to reproduce.

So far, the behavior you want to achieve needs to be done by deactivating the do_basic_tokenize feature on BertTokenizer, otherwise the input will be splitted on ponctuation chars before actually going through the wordpiece tokenizer.

I don’t think we have an equivalent on the Rust implementation of Bert, let me check internally and get back to you on this point.

Here a snippet of code which should achieve the desired behavior:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased", do_basic_tokenize=False)
tokenizer.tokenize("[unused1]")

>>> ['[unused1]']

tokenizer.encode("[unused1]", add_special_tokens=False)
>>> [1]

tokenizer.decode([1])
>>> '[unused1]'
Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer — transformers 3.5.0 documentation - Hugging Face
Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes a dummy input and checks the number...
Read more >
RFC 4120: The Kerberos Network Authentication Service (V5)
Since no one except the requesting principal and the server know the session key ... context token, the MIT implementation of Kerberos will...
Read more >
RFC 4120 - The Kerberos Network Authentication Service (V5)
Capability A token that grants the bearer permission to access an object or service. In Kerberos, this might be a ticket whose use...
Read more >
HTML Standard
One -Page Version html.spec.whatwg.org Multipage Version /multipage Version ... The UTF-8 encode algorithm which takes a character stream and returns a byte ...
Read more >
arXiv:2110.07415v2 [cs.CL] 6 May 2022
are concatenated into a passage of sentences, and encoded jointly using BERT. The contex- tual embeddings of tokens are aggregated us-.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found