Tokenizers: setting bos_token_id = 0 and adding language_pair_codes
See original GitHub issueI am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I’m doing?
wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model
from transformers import T5Tokenizer
vocab_file = 'sentence.bpe.model'
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0)
t2.bos_token_id # => 1
The following also returns 1
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0,
additional_special_tokens=['<s>'])
t2.bos_token_id
Help much appreciated!
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Tokenizer - Hugging Face
Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure ...
Read more >tokenizers - PyPI
Tokenizers. Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation.
Read more >what 's the meaning of "Using bos_token, but it is not set yet."
This means adding the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the...
Read more >Tokenizers | Apache Solr Reference Guide 6.6
You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer> : <fieldType name="text"...
Read more >Tokenizers - Nominatim 4.2.0
For information on how to configure a specific tokenizer for a database see the ... the indexer needs to add the token IDs...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.
For posterity, I think Thomas means this: