when I encode [unused1], return not one token
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): bert
Language I am using the model on (English, Chinese …): English
The problem arises when using: tokenizer.encode(‘[unused1]’)
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:relation extraction
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- tokenizer.encode(“[unused1]”)
- but return not one token, if using keras-bert, it will return me only one token
Expected behavior
Environment info
transformers
version: least version- Platform:
- Python version: 3.7
- PyTorch version (GPU?): 1.1.0
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Tokenizer — transformers 3.5.0 documentation - Hugging Face
Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes a dummy input and checks the number...
Read more >RFC 4120: The Kerberos Network Authentication Service (V5)
Since no one except the requesting principal and the server know the session key ... context token, the MIT implementation of Kerberos will...
Read more >RFC 4120 - The Kerberos Network Authentication Service (V5)
Capability A token that grants the bearer permission to access an object or service. In Kerberos, this might be a ticket whose use...
Read more >HTML Standard
One -Page Version html.spec.whatwg.org Multipage Version /multipage Version ... The UTF-8 encode algorithm which takes a character stream and returns a byte ...
Read more >arXiv:2110.07415v2 [cs.CL] 6 May 2022
are concatenated into a passage of sentences, and encoded jointly using BERT. The contex- tual embeddings of tokens are aggregated us-.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @jxyxiangyu! Thank you @BramVanroy & @mfuntowicz for the help on this!
I think in this case the easiest way to handle this, is by adding the tokens you plan to use as special tokens. After all, that’s what they are. They are not added by default since only a handful of them are actually used so you need to do it manually with
Then, it should work for both fast and slow tokenizers:
Hi @jxyxiangyu, thanks for reporting this, thanks @BramVanroy to making a code to reproduce.
So far, the behavior you want to achieve needs to be done by deactivating the
do_basic_tokenize
feature onBertTokenizer
, otherwise the input will be splitted on ponctuation chars before actually going through the wordpiece tokenizer.I don’t think we have an equivalent on the Rust implementation of Bert, let me check internally and get back to you on this point.
Here a snippet of code which should achieve the desired behavior: